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^— H Document retrieval is one of the best established information retrieval activities since the sixties, 

C ^ ;) pervading all search engines. Its aim is to obtain, from a collection of text documents, those most 

^S| relevant to a pattern query. Current technology is mostly oriented to "natural language" text 

collections, where inverted indices are the preferred solution. As successful as this paradigm has 
been, it fails to properly handle some East Asian languages and other scenarios where the "natural 
language" assumptions do not hold. In this survey we cover the recent research in extending the 
_^^ document retrieval techniques to a broader class of sequence collections, which has applications 

in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia in- 
\I formation retrieval, and many others. We focus on the algorithmic aspects of the techniques, 

^~^ uncovering a rich world of relations between document retrieval challenges and fundamental prob- 

lems on trees, strings, range queries, discrete geometry, and others. 

Q Categories and Subject Descriptors; E.l [Data structures]; E.2 [Data storage representa- 

tions]; E.4 [Coding and information theory]: Data compaction and compression; F.2.2 [Anal- 
jy^ ysis of algorithms and problem complexity]: Nonnumerical algorithms and problems — Pat- 

(^ tern matching, Computations on discrete structures, Sorting and searching; H.2.1 [Database 

' ' management]: Physical design — Access methods; H.3.2 [Information storage and retrieval]: 

Information storage — File organization; H.3.3 [Information storage and retrieval]: Informa- 
tion search and retrieval — Search process 
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O 1. INTRODUCTION 



Retrieving useful information from huge masses of data is undoubtedly one of the 
most important activities enabled by computers in the Information Age. Albeit 
images and videos have gained much importance on the Internet, most of the search 
activities, even on those supports, rely on searching data in the form of sequences 
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(e.g., Google finds images based on the text content around them in Web pages). 
The problem of finding the relevant information in large masses of text was already 
pressing in the sixties, where the basis of modern Information Retrieval techniques 



were laid out Salton 1968 . Nowadays, needless to say, this is one of the most 



important research topics in Computer Science Croft et al. 2009 Biittcher et al. 



2010 Baeza- Yates and Ribeiro-Neto 2011 



Interestingly, besides a large degree of added complexity to permanently improve 
the search "quality" (i.e., how the returned information matches the need expressed 
by the query) , the core of the approach has not changed much since Salton s time. 
One assumes that there is a collection of documents, each of which is a sequence of 
words. This collection is indexed, that is, preprocessed in some form. This index is 
able to answer queries, which are words, or sets of words, or sets of phrases (word 
sequences). A relevance formula is used to establish how relevant is each of the 
documents for the query. The task of the index is to return a set of documents 
most relevant to the query, according to the formula. 



In the original vector space model Salton 1968 , a set of distinguished words 



(called terms) was extracted from the documents. A weight w{t, d) for term t in 
document d was defined using the assumption that a term appearing many times 
in a document was important in it. Thus a component of the weight was the term 
frequency, tf (i, d), which is the number of times t appears in d. A second component 
was aimed to downplay the role of terms that appeared in many documents (such as 
articles and prepositions), as those do not really distinguish a document from others. 
The so-called inverse document frequency was defined as idf(i) = lg(D/df(i)), where 
D is the total number of documents and df (t) is the number of documents where t 
appearsF] Then w{t, d) = tf (i, d) x idf (i) was the formula used in the famous "tf-idf" 
model. The query was a set of terms, Q = {qi, q2, ■ ■ ■ , q-m], and the relevance of a 
document d for query Q was w{Q,d) = X]i=i ^('ZijC?)- Then the system returned 
the top-k documents for Q, that is, k documents d with the highest w{Q,d) value. 
When computers became more powerful, the so-called full-text model took every 
word as a querieable term. 

As said, this simple model has been sophisticated in recent years up to an amazing 
degree, including some features that are possible due to the social nature of the 
Internet: the intrinsic value of the documents, the links between documents, the 
fields where the words appear, the feedback and profile of the user, the behavior of 
other users that made similar queries, and so on. Yet, the core of the idea is still 
to find documents where the query terms appear many times. 

The inverted mrfeihas always been the favorite structure to support these searches. 
The essence of this structure could not be simpler: given the vocabulary of all the 
querieable terms, the index stores a list of the documents d where each such term t 
appears, plus information to compute its weight in each, w{t, d). Much research has 



been carried out to efficiently store and access inverted indexes [Witten et al. 1999 
Biittcher et al. 2010 Baeza- Yates and Ribeiro-Neto 2011|, without changing its 



essential organization. All modern search engines use variants of inverted indexes. 
Despite the immense success of this information retrieval model and implementa- 
tion, it has a clear limitation: it strongly relies on the fact that the vocabulary of all 



^We use Ig to specify logarithm in base 2 (when it matters). 
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the qu erieable terms has a manageable size. The empirical law proposed by [Heaps] 



[1978 establishes that the vocabulary of a collection of size n grows like O(n^) for 
some constant < /3 < 1, and it holds very accurately in many Western languages. 
The model, therefore, restricts the queries to be whole words, not parts of words. 
It is not even obvious how to deal with phrases. One could extend the concepts 
of tf and idf to phrases and parts of words, but this would be quite difficult to 
implement with an inverted index: one cannot store the list of documents where 
every conceivable text substring appears! 

Such a limitation causes problems in highly synthetic languages such as Finnish, 
Hungarian, Japanese, German, and many others, where long words are built from 
particles. But it is more striking in languages where word separators are absent 
from written text and can only be inferred by understanding its meaning: Chinese, 
Korean, Thai, Japanese (Kanji), Lao, Vietnamese, and many others. Indeed, "seg- 
menting" those texts into words is considered a research problem belonging to the 



area of Natural Language Processing (NLP); see, e.g., Rao and Xun 2012| 



Out of resorting to expensive and heuristic NLP techniques, a simple solution 
for those cases is to treat the text as an uninterpreted sequence of symbols and 



allow queries to find any substring in those sequences. Suffix trees Weiner 1973 



McCreight 1976 Apostolico 1985 and suffix arrays Gonnet et al. 1992 Manber and 
Myers 1993 , and their recent space-efficient versions Navarro and Makinen 2007 
are data structures that efficiently solve the pattern matching problem, that is, they 
point out all the positions in the sequences where a pattern appears. However, they 
are not easily modified to handle document retrieval problems, such as listing the 
documents, or just the most relevant documents, where the pattern appears. 

Extending the document retrieval technology to efficiently handle collections of 
general sequences is not only interesting to enable classical Information Retrieval 
on those languages where the basic assumptions of inverted indexes do not hold. 
It also opens the door to using document retrieval techniques in a number of areas 
where similar queries are of interest: 

Bioinformatics: Searching and mining collections of DNA, gene, and amino acid 
sequences is at the core of most Bioinformatic tasks. Genes can be regarded 
as documents formed by sequences of base pairs (A, C, G, T), proteins can be 
seen as documents formed by amino acid sequences (an alphabet of size 20), 
and even genomes can be modeled as documents formed by gene sequences 
(here each gene is identified with an integer number). Many searching and 



mining problems are solved with suffix trees Gusfield 1997 , and some are best 



recast into document retrieval problems. Some examples are listing the proteins 
where a certain amino acid sequence appears, or the genes where a certain DNA 
marker appears often, or the genomes where a certain set of genes appear, and 
so on. Further, bioinformatic databases integrate not only sequence data but 
also data on structure, function, metabolics, location, and other items that are 



not always natural language. See, for example, Bartsch et al. [2011| . 

Software repositories: Handling a large software repository requires managing 
a number of versions, packages, modules, routines, etc., which can be regarded 
as documents formed by sequences in some formal language (such as a pro- 
gramming or a specification language). In maintaining such repositories it is 
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natural to look for modules implementing some function, functions which use 
some expression in their code, packages where some function is frequently used, 
and also higher-level information mined from the raw data. Those are, again. 



typical document retrieval queries. See, for example, Linstead et al. [2009| . 

Chemoinformatics: Databases storing sets of complex molecules where certain 
compounds are sought are of much interest for pharmaceutical companies, to 
aid in the process of drug design, for example. The typical technique is to 
describe compounds by means of short strings that can then be searched. Here 
the documents can be long molecules formed by many compounds, or sets of 
related molecules. This is a recent area of research that has grown very fast in 



relatively few years, see for example Brown [2005| 



Symbolic music sequences: As an example of multimedia sequences, consider 
collections of symbolic music (e.g., in MIDI format). One may wish to look for 
pieces containing some sequence, pieces where some sequence appears often, 
and so on. This is useful for many tasks, including music retrieval, music 
analysis, authorship determination, plagiarism detection, and so on. See, for 



example, Typke et al. [2005 



These applications display a wide range of document sizes, alphabet, and types 
of queries (list documents where a pattern appears, or appears often enough, or 
most often, or find the patterns occurring most often, etc.). Moreover, while exact 
matching is adequate for software repositories, approximate searching should be 
permitted on DNA, some octave invariance should be allowed on MIDI, and so on. 

In this survey we focus on a basic scenario that has been challenging enough to 
attract most of the research in this area, and that is general enough to be useful in a 
wide number of cases. We consider document listing and top-A; document retrieval, 
and occasionally some extension, of single-string patterns that are matched exactly 
against sequence collections on arbitrary integer alphabets. In some cases we use 
the term frequency as the relevance measure, whereas in other cases we cover more 
general measures. In the Conclusions we discuss more complex scenarios. 

Soon in the survey, the relation between the document retrieval problems we 
consider and analogous problems on sequences of colors (or categories) becomes 
apparent. Thus problems such as listing the different colors, or count the different 
colors, or find the k most frequent colors, in a range of a sequence arise. Those 
so-called color range queries are not only algorithmically interesting by themselves, 
but have immediate applications in some further areas related to data mining: 

Web mining: Websites collect information on how users access them, in some 
cases for purposes like charging for the access, but in all cases those access logs 
are invaluable tools to learn about user access patterns, or favorite contents, 
and so on. Color range queries allow one, for example, to determine the number 
of unique users that have accessed a site, the most frequently visited pages in 
the site, the frequencies of different types of queries in a search engine, and so 



on. See, for example, Liu [2007 



Database tuning: Monitoring the usage of high-performance database servers is 
important to optimize their behavior and predict potential problems. Color 
range queries are useful, for example, to analyze the number of open sessions 
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in a time period, the most frequent queries or most frequently accessed tables, 



and so on. Shasha and Bonnet [2003| give a comprehensive overview. 



Business intelligence: The analysis of customer behavior is at the heart of many 
tasks related to business intelligence, which goes from finding which products 
and services to offer, to how to place products in shelfs. In these cases one seeks 
not only for, say, most frequently purchased categories of items, but also for 
more complex patterns such as sets of items frequently bought together. This 
is called itemset mining and is more complex than simple color range queries. 



yet these are important basic tools. See, for example, Han et al. [2007| . 

Social behavior: The analysis of words used on tweets, sites visited, topics queried, 
"likes" , and many other aspects of social behavior is instrumental to understand 
social phenomena and exploit social networks. Queries like finding the most 
frequent words used in a time period, the number of distinct posters in a blog, 
the most visited pages in a time period, and so on, are natural color range 



queries. See, for example, Silvestri [2010 



Bioinformatics again: Pattern discovery, such as finding frequent g-mers (strings 
of length q) in areas of interest in genomes, plays an important role in bioinfor- 
matics. For fixed q (which is the usual practice) one can see the genome as a 
sequence of overlapping g-mers, and thus pattern discovery becomes a problem 
of detecting frequent colors (g-mers) in a range of a sequence of colors (g-niers). 
See, once again, Gusfield [1997| . 



Unlike inverted indexes, which are algorithmically simple, the solutions for gen- 
eral document retrieval (and color queries) have a rich algorithmic structure, with 
many connections to fundamental problems on trees, strings, range queries, discrete 
geometry, and others. The main goal of this survey is to emphasize the fascinating 
algorithmic and data structuring aspects of the current document retrieval solu- 
tions. Thus, although we show the best existing results, we focus on the important 
algorithmic ideas, leaving the more technical details for further reading. In the way, 
we also fix some inaccuracies and even errors found in the literature, and propose 
new solutions to some of those problems. 

2. NOTATION AND BASIC CONCEPTS 

2.1 Notation on Strings 

A string 5 = S'[l,n] is a sequence of characters, each of which is an element of a 
set S called an alphabet. We will assume E = [l..cr] = {1,2, ... ,a}. The length 
(number of characters) of S'[l,7i] is denoted \S\ = n. We denote by S[i] the i-th 
character of S, and S[i,j] = S[i] . . . S[j] a substring of S. When i > j it holds 
S[i,j] = e, the empty string of length |e| = 0. A prefix of 5 is a substring of the 
form ^[l, j] and a suffix is of the form S[i, n]. By SS' we denote the concatenation 
of strings S and S", where the characters of 5" are appended at the end of S. A 
single character can stand for a string of length 1, thus cS and Sc, for c € E, also 
denote concatenations. 

The lexicographical order "^" among strings is defined as follows. Let a,b £ "E 
and let S and S" be strings. Then aS -< bS' ii a < b, or ii a = b and S ^ S' . 
Furthermore, e ^ S" for any S ^ e. 
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2.2 Model and Formal Problem 

We model the document retrieval problems to be considered in the following way. 

— There is a collection T) oi D documents T) — {Ti, . . . , Td}. 

— Each document Td is a nonempty string over alphabet S = [1..ct]. 

—We define T = TiSTaS . . . T^S as a string over S U {$}, $ = < c for any c G S, 
which concatenates all the texts in D using a separator symbol. 

— The length of T is \T\ = n and the length of each T^ is \Td\ — na- 

— Queries consist of a single pattern string P[l, m] over S. 

We define now the problems we consider. First we define the set of occurrence 
positions of pattern P in a document Td- 

Definition 1 (Occurrence Positions) Given a document string Td and a pat- 
tern string P , the occurrence positions (or just occurrences j of P in Td are the set 
occ{P,Td) = {1 + \Xl 3Y, Td = XPY}. 

Now we define the document retrieval problems we consider. We start with the 
simplest one. 

Problem 1 (Document Listing) Preprocess a document collection T> so that, 
given a pattern string P, one can compute list(I?, P) — {d, occ(P, Td) ^ 0}, that is, 
the documents where P appears. We call docc = |list(2?, P)\ the size of the output. 

Variants of the document listing problem, which we will occasionally consider, 
include computing the term frequency for each reported document, and computing 
the document frequency of P. Those functions are typically used in relevance 
formulas (recall measures tf and idf in the Introduction). 

Definition 2 (Document and Term Frequency) The document frequency of 
P in a document collection T> is defined as df(P) = docc = |list(2?, P)|, that is, the 
number of documents where P appears. The term frequency of P in document d is 
defined as tf(P, rf) = |occ(P, Td)!, that is, the number of times P appears in Td- 

Our second problem relates to ranked retrieval, that is, reporting only some 
important documents instead of all those where P appears. 

Problem 2 (Top-A; [Most Frequent] Documents) Preprocess a document col- 
lection T> so that, given a pattern string P and a threshold k, one can compute 
top{V, P,k) C list(X',P) such that |top(X', P, fc)| = min(A:, df (P)) and, for any d G 
top(X>,P,A:) andd' G list(X', P) \ top(X', P, fc), it holds |occ(P,Td)| > |occ(P,Td/)|. 
That is, find k documents where P appears the most times. This latter condition 
can be generalized to any other function of occ{P,Td) and occ(P, T^/). 

A simpler variant of this problem arises when the importance of the documents 
is fixed and independent of the search pattern (as in Google's PageRank) . 
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Problem 3 (Top-fc Most Important Documents) Preprocess a document col- 
lection T) so that, given a pattern string P and a threshold k, one can compute 
top(V, P,k) C list(X',P) such that \top(V , P, k)\ = min(A:, df(P)) and, for any 
d e top(P, P, k) and d' e list(X>, P) \top(2?, P, k), it holds W{d) > W{d'), where W 
is a fixed weight function assigned to the documents. 

2.3 Some Fundamental Problems and Data Structures 

Before entering into the main part of the survey, we cover here a few fundamental 
problems and existing solutions to them. Understanding the problem definitions 
and the complexities of the solutions is sufficient to follow the survey. Still, we 
give pointers to further reading for the interested readers. Rather than giving early 
isolated illustrations of these data structures, we will exemplify them later, when 
they become used in the document retrieval structures. 

2.3.1 Some Compact Data Structures. Many document retrieval solutions require 
too much space in their simplest form, and thus compressed representations are 
used to reduce their space up to a manageable level. We enumerate some basic 
problems that arise and the compact data structures to handle them. Most of 
these are covered in detail in a previous survey Navarro and Makinen 2007 , so 
we only list the results here. All the compact data structures we will use, and 
the document retrieval solutions we build on them, assume the RAM model of 
computation, where the computer manages in constant time words of size 0(lgn), 
as it must be possible to address an array of n elements. The typical arithmetic 
and bit manipulation operations can be carried out on words in constant time. 

A basic problem is to store a sequence over an integer alphabet so that any 
sequence position can be accessed and also two complementary operations called 
rank and select can be carried out on it. 

Problem 4 (Rank/Select/Access on Sequences) Represent a sequence C[l,n] 
over alphabet [1,D] so that one can answer three queries on it: (1) accessing any 
C[i]; (2) computing rankc(C, i), the number of times symbol c e [1,^^] occurs in 
C[l,i]; (3) computing selectc(C, j), the position of the jth occurrence of symbol c 
in C. It is assumed that rankc(C, 0) = selectc(C, 0) = 0. 

A basic case arises when the sequence is a bitmap B over alphabet {0, 1}. Then 
the problem can be solved in constant time and using sublinear extra space. 



Solution 1 (Rank/Select/Access on Bitmaps) [Jacobson 1989 Munro 1996 



Clark 1996] By storing o{n) bits on top of B[l,n] one can solve the three queries 



m 



constant time. 



There exist also solutions suitable for the case where B contains few Is or few 
Os. From the various solutions, the following one is suitable for this survey. Note 
that access queries can be solved using B[i] ~ ranki(P, i) — ranki(i3, i — 1). 



Solution 2 (Rank/Select/Access on Bitmaps) [Raman et al. 2007 A bitmap 



B[l, n] with m Is (or m Os) can be stored in rnlg — + 0(rn) + o{n) bits so that the 
three queries can be solved in constant time. 
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A weaker version of this representation can only compute ranki{B,i) in those 
positions where B[i] — 1, and it cannot determine whether this is the case. This is 
called a monotone minimum perfect hash function (mmphf) and can be stored in 
less than the space required for compressed bitmaps. 



Solution 3 (Mmphfs on Bitmaps) [Belazzougui et al. 2009 A bitmap B[l,n] 



with m Is can be stored in 0{mlglg — ) bits so that ranki(_B, i), if B[i] = 1, is com- 
puted in 0(1) time. If B[i] = the query returns an arbitrary value. Alternatively, 
the bitmap can be stored in 0(?7ilglglg — ) bits and the query time is 0(lglg — ). 

There are also various efficient solutions for general sequences. One uses wavelet 
trees Grossi et al. 2003} [Navarro 2012 , which we will describe in detail later in 
the survey due to their many applications in document retrieval. When we only 
need to solve Problem |4] and the sequence does not offer relevant compression 
opportunities, as will be the case in this survey, the following result is sufficient 
(although the results can be slightly improved Belazzougui and Navarro 2012|). 



Solution 4 (Rank/Select/Access on Sequences) [Golynski et al. 2006[ Grossi| 



et al. 20101 A sequence C[l, n] over alphabet [1, D] can be stored in nlgD+o{n\gD) 



bits so that query rank can be solved in time O(lglg-D) and, either C[i] can be ac- 
cessed in 0{1) time and query select can be solved in 0(\glgD) time, or vice versa. 

Finally, we will make use of compressed tree representations. These represent a 
general tree of n nodes so that navigation operations are carried out in constant 
time, using as little space as possible to describe its topology. From the many tree 
representations in the literature, the following one is convenient for this survey. 



Solution 5 (Tree Representation) [Sadakane and Navarro 20"l0 A general tree 



of n nodes can be represented using 2n + o{n) bits so that a large number of navi- 
gation operations on the tree can be carried out in constant time. 

2.3.2 Range Minimum Queries (RMQs) and Lowest Common Ancestors (LCAs). 
Many document retrieval solutions make heavy use of the following problem on 
arrays of integers. 

Problem 5 (Range Minimum Query, RMQ) Preprocess an array L[l,n] of 
integers so that, given a range [sp, ep] , we can output the position of a minimum 
value in L[sp,ep], RMQ]^{sp,ep) = argminj<p<j L[p]. 

The RMQ problem has a rich history, which we partially cover in Appendix [A 



An interesting data structure related to it is the Cartesian tree Vuillemin 1980| 



Definition 3 (Cartesian Tree) The Cartesian tree of an array L[l,n] is a binary 
tree whose root corresponds to the position p of the minimum in L[l,n], and the 
left and right children are, recursively, Cartesian trees of L[l,p—1\ and L[p-\-l,n], 
respectively. The Cartesian tree of an empty array interval is a null pointer. 

Cartesian trees are instrumental in relating RMQs with the following problem. 
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Problem 6 (Lowest Common Ancestor, LCA) Preprocess a tree so that, given 
two nodes u and v, we can output the deepest tree node that is an ancestor of both 
u and V. 

The main results we need on RMQs are summarized in the foUowing two solu- 
tions. The first is a classical result stating that the problem can be solved in linear 
space and optimal time. 



Solution 6 (RMQ) [Harel and Tarjan 1984 Schieber and Vishkin 1988 Berkman 



and Vishkin 1993 Bender and Farach-Colton 2000 The problem can be solved using 



linear space and preprocessing time, and constant query time. 

The second result shows that, by storing just 0{n) bits from the original array 
L, we can solve RMQs without accessing L at query time. This is relevant for the 
compressed solutions. 



Solution 7 (RMQ) [Fischer and Heun 2011 The problem can be solved using 



2n + o{n) bits, linear preprocessing time, and constant query time, without accessing 
the original array at query time. This space is asymptotically optimal. 

Similarly, the related LCA problem can be solved in constant time using linear 
space, and even on a tree representation that uses 2n + o{n) bits for a tree of n 
nodes (see Solutionis]). 

3. OCCURRENCE RETRIEVAL INDEXES 

In this section we cover indexes that handle collections of general sequences, but 
that address the more traditional problem of finding or counting all the occurrences 
of a pattern P in a text T (i.e., computing occ(P,T) or |occ(P,T)|). We focus on 
those upon which document retrieval indexes are built: suffix trees, suffix arrays, 
and compressed suffix arrays. 

3.1 Generalized SufFix Trees 

Consider a text T[l, n] = Ti$r2$ . . . TdS over alphabet S U {$}. Now consider the 
n suffixes of the form S = {Td[i, n^JS, l<d<D,l<i<nd + l}. The Generalized 
Suffix Tree (GST) of T is a data structure storing those n strings in Srj 

To describe the GST, we start with a tree where the edges are labeled with 
symbols in S U {$}, and where each string in S can be read by concatenating the 
labels from the root to a leaf. No two edges leaving a node have the same label, and 
they are ordered left to right according to those labels. The string label of a node 
is the concatenation of the characters labeling the edges from the root to the node. 
Thus, each string label in the tree is a unique prefix in S, and there is exactly one 
tree leaf per string in S. 

To obtain a GST from this tree we carry out three steps: (1) remove all nodes 
with just one child, appending its label to that of its parent (now edges will be 
labeled with strings); (2) attach to leaves the starting position of their suffix in T; 



^For technical reasons each "$" symbol should be different, but this is not done in practice. We 
prefer to ignore this issue for simplicity. 
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(3) retain only the first character and the length of the strings labeling the edges, 
that is, labels will be of the form (c, £), with c e E U {$} and £ > 0. 

Example. We introduce our running example text collection. To combine read- 
ability and manageability, we consider an alphabet of syllables on a hypothetic lan- 
guager\ The alphabet of our texts will &e S = {la, ma, me, mi}. Our document collec- 
tion T> — {Ti, T2, T3, T/^} will have D ~ A texts, Ti ~ "mi ma ma" , T2 = "la ma la" , 
T3 = "me mi ma" , T4 = "la me me" . Their lengths are ni = n2 = n^ = n^ = 3. 
They are concatenated into a single text 

T — "mi ma ma $ la ma la $ me mi ma $ la me me $" 

of length n — 16. Fig. [7] shows the individual suffix trees of the texts, plus the GST 
of T (which we also call the GST ofT>). 

Note that, because we do not use a distinct "$" terminator per document, some 
anomalies arise in our example, with leaves corresponding to several symbols. As 
explained, those do not cause any problem in practice. 

As mentioned, suffix trees (which are GSTs of only one text Ti%) and GSTs 
are used for many complex tasks [Apostolico 1985 Gusfield 1997 Crochemore 



and Rytter 2002 , yet in this article we will only describe the simplest one as an 
occurrence retrieval index. 

Consider a pattern P[\,m]. We start reading its characters in sequence, from 
1 to TO. We first find the root child labeled (P[l],£). If it does not exist, then P 
does not appear anywhere in T. If it exists, then we move to that child and look 
for the rest of the pattern, P[l -\- £,m\, in the same way. Apart from determining 
that P does not occur in 7", the process may end in two forms: (1) we arrive at 
a GST node v and have consumed all the characters in PJj or (2) we arrive at a 
GST leaf without having consumed all of P. Case (1) means that P may occur in 
T, as we have not matched the skipped characters, but all of the leaves descending 
from V share the same prefix up to str{v). Thus we pick any leaf descending from 
v (v may be itself a leaf), take its position p in T, and directly compare P[l,m] 
with T[p, p -\- m — 1]. If they do not match, then P does not occur in T. If they 
do, then v is said to be the locus of P: each leaf descending from v has attached 
the position of an occurrence of P in T. Case (2) is a border condition where the 
suffix is shorter than P, and implies that P does not occur in T. 

Example. Searching for P ~ "mi me ma" in the GST will lead to the rightmost 
leaf, pointing to position 1 in T, but a check against T[l,3] will reveal that the 
skipped symbol does not match. Instead, a search for P — "mi ma" will end up in 
the rightmost child v of the root, and a comparison with some leaf, say T[10, 11] 
will show that P does match with str{v). Hence occ(P, T) = {10, 1}. 

With the above procedure we can compute occ(P, T) in 0(to+ |occ(P, T)\) time, 
and |occ(P, T)| in 0{m) time, on integer alphabetsr] For the latter we must record 



^Suspiciously close to Spanish. 

*It might be that the edge towards v is longer than what remains to be read in P. In this case 

we still follow the edge. 

^If a is not taken as a constant we require perfect hashing to obtain 0{m) time and linear space 

for the structure; otherwise 0(mlg(T) time is achieved with binary search on the children. 
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12 3 4 
T1 mi ma ma $ 



mi, 4 




12 3 4 
T2 la ma la $ 



ma, 3 




12 3 4 

T3 me mi ma $ 




12 3 4 
T4 la me me $ 




me, 2 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
T mi ma ma $ la ma la $ me mi ma $ la me me $ 




Fig. 1. The individual sufBx tree of each document and the GST of the concatenated text T, in 
our running example text collection. For legibility we omit the edge length when it is 1. 

in each node v the number of leaves in its subtree. 

A formal succinct definition of GSTs, plus a couple of key concepts, follows. 

Definition 4 (Generalized Suffix Tree, GST) The generalized suffix tree of a 
text collection V ~ {Ti, T2, . . . , To} is a path- compressed trie storing all the suffixes 
of T = Ti$T2$ ■ ■ ■ Td$, where the $ is a special character. The string label str{v) 
of a node v is the concatenation of the string labels of the edges from the root to v. 
The locus of a pattern P is the highest node v such that P is a prefix of str{v). 

Note that the search times are independent of the length of the text T, which 
is a remarkable property of suffix trees. Other good properties are that it takes 
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12 16 7 



13 



3 11 6 



15 



14 9 10 



1 



Fig. 2. The sufHx array of text T on our running example. 



linear space (i.e., O(nlgn) bits) since it has n leaves and no unary nodes, and that 



it can be built in linear time for constant alphabets Weiner 1973 McCreight 1976 



Ukkonen 1995| , and also on integer alphabets [Farach 1997 Karkkainen et al. 2006 

The GST is a useful tool to group the O(n^) possible substrings of T (and hence 
possible search patterns) into 0(n) nodes, where each node represents a group of 
substrings that share the same occurrence positions in T- This allows one to store, 
in linear space, information that is useful for document retrieval. For example, 
one can associate to each GST v node the number of distinct documents where 
str(v) appears, df(sir(w)), which allows us to solve in 0{m) time and linear space 
the problem of computing docc. This ability has been exploited several times for 
document retrieval. 

3.2 Suffix Arrays 

The suffix array [Manber and Myers 1993] IGonnet et al. 1992 of a text T is a 



permutation of the (starting positions of) suffixes of T, so that the suffixes are 
lexicographically sorted. Alternatively, the suffix array of T is the sequence of 
positions attached to the leaves of the sufhx tree, read left to right. 



Definition 5 (Suffix Array) T/ie suffix array o/ a coZfecf ion I? = {Ti,T2, ..., To} 
is an array A[l, n] containing a permutation of [l..n\, such that T[A[i], n] -< T[A[i + 
1], n] for all 1 < i < n, where T[l, n] = ri$r2S ■ • • To$- 

Suffix arrays also take linear space and can be built in 0{n) time, without the 
need of building the suffix tree first Kim et al. 2005 Ko and Aluru 2005 Karkkainen 



et al. 2006 



Example. Fig. [E illustrates the suffix array for our example. 

An important property of suffix arrays is that each subtree of the suffix tree 
corresponds to an interval of the suffix array, namely the one containing its leaves. 
In particular, having A and T, one can just binary search the suffix array interval 
corresponding to the occurrences of a pattern P[l,m], in 0{m\gn) time (that is, 
O(lgn) comparisons of m symbols) r] Another way to see this is that, since suffixes 
are sorted in A, all those starting with P form a contiguous range. Once we 
determine that all the occurrences of P are listed in A[sp, ep], we have |occ(P, T)\ ~ 
ep— sp+ 1 and occ(P, T) = {A[sp], A[sp + 1], . . . , A[ep]}. 



®By storing more data, this can be reduced to 0(m + Ign) time Manber and Myers 1993 
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Reference Space in bits tsearch{"*) ^SA 



[Grossi et al. 


2003[ 


[Grossi et al. 


2003[ 


Grossi et al. 


2003 


[Sadakane 2003 




Ferragina et al. 2007 


Barbay et al 


2011 




[Barbay et al 


2010 


[Barbay ct al 
[Bclazzougui 


2010 
andj 


Navarro 201 1[ 



(l+i)n/ffe(r)+0(n) 0(mlgcr+lg4n) 0(lg>+lg<7) 

nHfc(r)lglg„n+0(n) 0(m Ig cr+ Ig* n) 0{lglg„ n+lga) 

nHk{T)+o{nlg(T) 0{mlg(T+lg*n) O(lgnlglgcr) 

^^nHo{r)+0(nlgHo{r)) O(mlgn) 0(lg^ n) 

nHUT)+o{nlga) 0{m(l+j^)) 0(lgn(l+lg, Ign)) 

nHk(r)+o{nlga) 0{mlglga) 0(lg, n(lglga)2) 

nHk{T)il+o{l)))+o(n) 0{m(l+j^)) 0(lgn{lga+lglgn)) 

nHk(T){l+o{l))+o{n) 0(mlglg,7) 0(lgn(lglg<7)2) 

nfffe(r)(l+o(l))+0(n) Oim) O(lgn) 



Table 1. Space and time performance of some of the best CSAs to date. The spaces hold for 
any k < alg^(7i) — 1 and constant < a < 1. The access time is obtained by choosing a suitable 
sampling step in various CSAs. 

3.3 Compressed SufFix Arrays 

A compressed suffix array (CSA) over a text collection I? is a data structure that 
emulates a suffix array on T within less space, usually providing even richer func- 
tionality. At most, a CSA must use 0{n\ga) bits of space, that is, proportional to 
the size of the text stored in plain form (as opposed to the 0{n Ig n) bits of classical 
suffix arrays). There are, however, several CSAs using as little as nHk{T)+o{n Ig a) 
bits, where Hk{'T) < Igcr is the fc-th order empirical entropy of T. This is a lower 
bound to the bits-per-symbol performance achievable on T by any statistical com- 
pressor that encodes each symbol according to the k symbols that precede it in the 
text Manzini 2001 . In practice, nHk{T) is the least space a statistical encoder 



can achieve on T ■ 

CSAs are well covered in a relatively recent survey Navarro and Makinen 2007 
so we only summarize the operations they support. First, given a pattern P, they 
find the interval A[sp, ep] of the suffixes that start with P, in time tsearchl^^)- Second, 
given a cell i, they return A[i], in time ts/\- Third, they are generally able to 
emulate the inverse permutation of the sufSx array, ^^^[j], also in time isA- This 
corresponds to asking which cell of A points to the suffix T[j, n] . Finally, many 
CSAs are self-indexes, meaning that they are able to extract any substring T[i, j] 
without accessing T, so the text itself can be discarded. Those CSAs replace T 
by a (usually) compressed version that can in addition be queried. The following 
definition captures the minimum functionality we need from a CSA in this article. 

Definition 6 (Compressed Suffix Array, CSA) A compressed suffix array (CSA) 
is a data structure that simulates a suffix array on text T[l,n\ over alphabet [l,cr] 
using at most 0{n\ga) bits. It finds the interval A[sp,ep\ of a pattern P[l,rn] in 
time tsearchi'm) , cmd computes any A[i] or A^^[j] in time isA- 

Table IT] lists the performance of some of the best CSAs to date (which include 
several ones not included in the previous survey) , to give an idea of the performances 
they offer. All of them are self-indexes. 
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Fig. 3. The C and L arrays for our running example. 



4. DOCUMENT LISTING IN LINEAR SPACE 



Muthukrishnan [2002 gave an optimal solution to the document listing problem 



(Problem 1), within linear space, that is, O(nlgn) bits (see Janardan and Lopez 
|[1993| and Matias et al. [1998] for previous work). Muthukrishnan introduced the 
so-called document array, which has been used many times since then. 

Definition 7 (Docunient Array) Given a document collection T>, its text T, 
and the suffix array A[l,n\ ofT, the document array C[l,n] contains in each C[i] 
the number of the document suffix A[i] belongs to. 

It is not hard to see that all we need for document listing is to determine the 
interval A[sp, ep] corresponding to the pattern and then output the set of distinct 
values in C[sp, ep\. This gives rise to the following algorithmic problem. 

Problem 7 (Color Listing) Preprocess an array C[l,n] o/ colors in [1,-D] so 
that, given a range [sp,ep\, we can output the different colors in C[sp,ep\. 



To solve this problem, [Muthukrishnan defines a second array, which is also fun- 
damental for many related problems. 

Definition 8 (Predecessors Array) Given an array C[l,n], i/ie predecessors ar- 
ray of C is L[l, n] such that L[i] = max{l < j < i, C[j] = C[i]} U {0}. 

That is, array L links each position in C to the previous occurrence of the same 
color, or to position if this is the first occurrence of that color in C. 

Example. Fig. [M illustrates arrays C and L on our running example. We show 
how L acts as a linked list of the occurrences of color 1. 

Color listing is then based on the following lemma, which is immediate to see. 



Lemma 1 Muthukrishnan 2002 // a color d occurs in C[sp, ep], then its leftmost 



occurrence p G [sp, ep] is the only one where it holds L[p] < sp. 

From the lemma, it follows that all we have to do for color listing is to find all 



the values smaller than sp in L[sp, ep]. To do this in optimal time, Muthukrishnan 
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makes use of RMQs (more precisely, Solution p| . The algorithm proceeds recur- 
sively. It starts with the interval [i,j] = [sp,ep]. It first finds p = RMQ]^{i,j). If 
L[p\ < sp, then C[p] is a new distinct color in C[sp, ep] and can be reported im- 
mediately. Then we continue recursively with the intervals [i,p — 1] and [p + 1, j]. 
If, instead, L[p] > sp, then position p is not the first occurrence of color C[p] in 
C[sp, ep], and moreover no position in C[«, j] is the first of its color. Thus we termi- 
nate the recursion for the current interval [i,j]. Note that we always compare L[p] 
with the original sp limit, even inside a recursive call with a smaller [i,j] interval. 
The recursive calls define a binary tree: at each internal node (where L[p] < sp) 
one distinct color appearing in C[sp, ep] is reported, and two further calls are made. 
Leaves of the recursion tree (where L\p] > sp) report no colors. Hence the recursion 
tree has twice as many nodes as colors reported, and thus the algorithm is optimal 
time. Indeed, it is interesting to realize that what this algorithm is doing is to 
incrementally build the top part of the Cartesian tree of L[sp, ep], recall Definitionlsj 

Example. For a relevant example, consider color listing over C[sp, ep] — C[ll, 16] 
in the array of Fig. U^ (this corresponds to document listing of a lexicographic pattern 
range ["ma ma", "mi ma"], which is perfectly possible on suffix arrays). 

We start with p = RMQi(i,j) = RMQi^{sp,ep) — RMQi(ll,16) — 12. Since 
L\p] = L[V1] = 7 < 11 = sp, we report color C[p] = C[12] = 4. Now we continue on 
the left suhinterval, L[i,p—1] = L[\l, 11]. Here obviously we havep = 11, and since 
L[ll] ~ 8 < 11 we report C[ll] = 1. Now we go to the right of the initial recursive 
call, for L[13, 16]. We compute p = RMQ^(13, 16) = 14. Since L[14] = 9 < 11, we 
report C[14] = 3, and recurse on both sides of p. The left side is L[13, 13]. Since 
L[13] = 12 > 11, we do not report this position and terminate the recursion. The 
right side is L[15, 16]. Once again, we compute p = RMQ^(15, 16) = 16, and since 
L[16] = 11 > 11, we also terminate the recursion here. Note we have not needed to 
examine L[15] to know it does not contain new colors. We have correctly reported 
the colors 4; 1, o-i^d 3. 

In the bottom part of the figure we show the part of the Cartesian tree of L[ll, 16] 
we have uncovered, or what is the same, the tree of the recursive calls. Shaded nodes 
represent colors reported (also marked in C ), empty nodes represent cells where the 
recursion ended, and dotted nodes are the part we have not visited of the Cartesian 
tree (usually much more than just one node). 

The algorithm is not only optimal time but also real time: As it reports the color 
before making the recursive calls, the top part of the Cartesian tree can be thought 
of as generated in preorder. Thus we list the first t distinct colors in time 0{t). 



Solution 8 (Color Listing) [Muthukrishnan 2002 The problem can be solved in 
linear space and real time. 



In addition to this machinery, Muthukrishnan uses a suffix tree on 7" to compute 



the interval [sp, ep] . This immediately gives an optimal solution to the document 
listing problem. 



Solution 9 (Document Listing) [Muthukrishnan 2002 The problem can be solved 



in 0{m + docc) time and O(nlgn) bits of space. 
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Fig. 4. Color listing on C[ll, 16] in our running example. 



Muthukrishnan [2002| considered other more complex variants of the problem, 
such as listing the documents that contain t or more occurrences of the pattern, or 
that contain two occurrences of the pattern within distance t. Those also lead to 
interesting, albeit more complex, algorithmic problems. 

5. DOCUMENT LISTING IN COMPRESSED SPACE 



Sadakane [2007| addressed the problem of reducing the space of Muthukrishnan 
solution. He replaced the suffix tree by a CSA (Definition [6]), and proposed the first 
RMQ solution that did not need to access L (this one used An + o{n) bits; the one 
we have referenced in Solution [7^ uses the optimal 2n + o(n)). Thus array L was not 



necessary for computing RMQs on it. Muthukrishnanf s algorithm, however, needs 
also to ask if L[p] < sp in order to determine whether this is the first occurrence of 
color C[p] or not. Sadakane replaces it by consulting a bitmap V[l, D] (set initially 
to all Os), so that if ^[(^[p]] = then the document has not yet been reported, so 
we report it and set T^[C[p]] <~ 1. 

Just as before, Sadakane ends the recursion at an interval [«, j] when its minimum 
position p satisfies l^[C[p]] = 1. There is a delicate point about the correctness of 
this algorithm, which is not stressed in that article. Replacing the check L[p] < sp 
by F[C[p]] = only works if we first process recursively the left interval, [i,p— 1], 
and then the right interval, [p + l,j]- In this case one can see that the leftmost 



occurrence of each color is found and the algorithm visits the same cells of Muthukr 



ishnan s (we prove this formally in Appendix^. Otherwise, an error can occur 

Example. In the same Fig.lA consider color listing in the interval C[5, 11], where 
the four colors appear. Run Sad akane\ 's algorithm going left first and verify that it 
behaves exactly as \Muthukrishnan\ 's algorithm. Now consider processing the right 
interval first. We start withp = RMQ^(5, 11) — 8, report C'[8] = 1 and set V[\] ^r- 1. 
Now we go right and process C[9, 11]. Here we find p — RMQj;,(9, 11) — 9, report 
C[9] = 3 and set V\i\ ^ 1. Then we process C[10, 11], findp = RMQi(10, 11) = 10, 
and since C[10] = 2 and V\2\ = 0, we report color 2 and set Y\2\ 4— 1 (note that 



Muthukrishnan 's algorithm would not have reported color 2 here because L[p\ > sp). 
Finally it processes C[ll, 11], where color C[ll] — I is not reported because V[l] is 
already 1. Now we finally go to the left child of the initial recursive call, interval 
C[5, 7]. Here we compute p = RMQ2,(5, 7) = 5, and since C[5] — 2 and V[2] = 1, 
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T mi ma ma $ la ma la $ me mi ma $ la me me $ 
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Fig. 5. The structures for document listing in compressed space, on our running example. The 
grayed structures arc not stored. 

the recursion terminates without having reported color C[7] =4. 



Sadakane s solution yields a solution that uses little space on top of the original 



array, as opposed to Muthukrishnan s, which uses 0(nlgn) extra bits for L. 



Solution 10 (Color Listing) [Sadakane 2007| The problem can be solved using 
0{n) bits of space on top of array C , and in real time. 

The reader may have noticed that we should reinitialize V to all zeros before pro- 
ceeding to the next query. A simple solution is to remember the documents output 
by the algorithm, so as to reset those entries of V after finishing. This requires 
docclg-D < DlgD bits, which may be acceptable. Otherwise, array V could be 
restored by rerunning the algorithm and using the bits with the reverse meaning. 
In practice, however, this doubles the running time. A more practical alternative 



is to use a classical solution to initialize arrays in constant time Mehlhorn 1984 
Although this solution requires 0{D\gD) extra bits, we show in Appendix |C| how 
to reduce the space to 0{D) = 0{n) bits, and even D + o{D) bits. 

Note that we have removed array L, but C is still used to report the actual 
colors. For the specific case of document listing, [Sadakane] also replaced array C 
by noticing that it can be easily computed from the CSA and a bitmap i?[l,n] 
that marks with Is the positions of the "$" symbols in T. Then, using the rank 
operation (Problem l4| we compute C[i] — 1 + ranki(_B, ^[i] — 1). While rank can 
be computed in constant time (Solution nl), the computation of A[i] using the CSA 
(Definitional) requires time isA- Overall, the following result is obtained. 



Solution 11 (Document Listing) [Sadakane 2007 The problem can be solved in 



time 0{t search {'Ti') + docctsA) (^d |CSA| + 0{n) bits of space, where CSA is a CSA 
indexing T>. 

Example. Fig. [^ illustrates the components of \Sadakane\ 's solution. 



Hon et al. 2009[ were even more ambitious and attempted to reduce the 0{n) 



bit term in the space complexity to just o(n). They group b consecutive entries 
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in L. Then they create a sampled array L'[l^n/b] where each entry contains the 
minimum value in the corresponding block of L. The RMQ data structure is built 
over L', and they run the algorithm over the blocks that are fully contained in 
L[sp,ep]. Each time a position in L' is reported, they consider all the b entries 
in the corresponding block of i, reporting all the documents that have not yet 
been reported. Only if all of them have already been reported can the recursion 
stop. They have also to process by brute force the tails of the interval L[sp,ep] 
that overlap blocks. Therefore they have a multiplicative time overhead of 0{h) 
per document reported, in exchange for reducing the 0(n)-bit space to 0{n/b). 
This idea, unfortunately, does not work, for the same delicate reason we have 



described on Sadakane s method (where it still worked). Marking as visited other 
documents than the one holding the minimum L value (namely all others in the 
block) can make the recursion stop earlier than it should. 

Example. Consider for example the array of colors C = (2, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1), 
with corresponding predecessor array L = (0,0,2,3,1,5,6,7,8,0,10,11), and a 
grouping factor b ~ 2. Therefore we have L' — (0,2,1,6,0,10). // the query is 
for the interval [3, 12], it is mapped to [2,6] in L' . The minimum is in L'[5] = 0, 
which makes the algorithm mark colors C[9] = 2 and C[10] = 1, setting V[l\ <— 1 
and V[2] ^ 1. Now we go left and consider subinterval L'[2,4], with the minimum 
in L'[S\ = 1. The corresponding colors are C[5] = 2 and C[6\ = 2. But since 
V[2] = 1, both cells are already reported and the algorithm finishes in this branch 
of the recursion, missing color C[3] = C[4] =3. 

Therefore, the document listing problem using space |CSA| +o{n) is open, even if 
D = o{n). This is an interesting goal because it uses asymptotically optimal space. 
We are going to give a satisfactory solution to this problem at the end of Section [TO] 



(Solution 35 1. 



6. COMPUTING TERM FREQUENCIES 

As explained in the Introduction, the term frequency tf (P, d) is a key component 
in many relevance formulas, and thus the problem of computing it for the docu- 
ments that are output by a document listing algorithm is relevant. In terms of the 
document array, this leads to the following problem on colors. 

Problem 8 (Color Listing v^rith Frequencies) Preprocess an array C[l,n] of 
colors in [1,-D] so that, given a range [sp,ep], we can output the distinct colors in 
C[sp,ep\, each with its frequency in this range. 

Given a color d, computing its frequency in C[sp, ep] is easily done via rank oper- 
ations (Problemll]): rankd{C, ep) — rankd(C, sp— 1). Therefore, any solution to color 



listing (e.g.. Solution 10 1 plus any solution to computing rank on sequences (e.g. 
Solutionis yields a solution to color listing with frequencies. Indeed, Solution H is 



close to optimal Belazzougui and Navarro 2012 . Thus, one can solve color listing 



with frequencies using o(n IgD) additional bits and 0(\glgD) per color output. 
Somewhat surprisingly, as we will see, the problem can be solved faster. 
The corresponding problem on documents is thus defined as follows. 
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Problem 9 (Document Listing with Frequencies) Preprocess a document col- 
lection T> so that, given a pattern string P, one can compute {(d, df (P, d)), df (P, d)>0\. 



Valimaki and Makinen [20071 proposed to reduce document listing with frequen- 



cies to color listing with frequencies on the document array C, using the basic 
technique we have described. More interestingly, they showed that the whole doc- 
ument listing problem can be reduced to rank and select operations on C . Array 
L can be simulated as L[i] = 5elect(g[ j](C, rank(g[;](C, z) — 1) , and we can use the 



original document listing algorithm of Muthukrishnan [2002| over this simula tion, 
yet using the 0(n)-bit RMQ of Solution l7| Although |Valimaki and Makinen used 



wavelet trees to represent C, using instead Solution |4] yields the following result. 



Solution 12 (Document Listing with Frequencies) [Valimaki and Makinen 



2007 The problem can be solved in time 0(tsearch('7i) -I- docc IglgD) and |CSA| 



nlgD -|- o(nlg£') bits of space, where CSA is a CSA indexing T). 

While computing term frequencies within that time is attractive, the extra space 
is much higher than the near-optimal one of Solution |11[ The document array is 



usually much larger than the text or its CSA. Sadakane [2007| proposed instead a 
solution that doubles the CSA space, and can compute term frequencies. Unlike 
the solution above, [Sadakane"[ s does not work on general arrays of colors and cannot 
compute the term frequency of an arbitrary document, but only of those output by 
a document listing algorithm. 

The idea is that, in addition to the global CSA, we will maintain a CSA for the 
local suffix array Ad of each document Td- Since the pointers of Ad are interspersed 
in A in the same order, we can do the following to map a global entry A\i] to the 
corresponding local entry Ad[j] in its document d. First, compute the global text 
position p — A[i]. Then find the corresponding document d = 1 + ranki(B,p — 1) 
and its starting position in T, s = l + select(i?, d—1). Finally, use the inverse suffix 
array to map the local offset p—s+1 into the local suffix array of d, j = ^j^^[p— s-|-l]. 
This can be done in time isA using CSAs for A and for all the Ad- 
Example. Let us map j4[10] to its local suffix array. Fig. ^illustrates the necessary 
components. We compute p = A[10\ = 6 using the global CSA. Now we compute 
the corresponding document, d = 1 + ranki(i?,6 — 1) = 2, its starting position 
s = 1 + select(i?, 2 — 1) = 5, and the local offset p — s + 1 = 2. Now we obtain 
A2 [2] — 4 with the local CSA of document 2. Certainly, 71.2 [4] = 2 points to the 
local suffix "ma la $", which is the same global suffix A[10] — 6. 

Now imagine we know the first and last occurrence positions of a document d 
in C[sp,ep], say i and i' . We use the above procedure to map them to positions 
j and j' in Ad. Those are clearly the first and last suffixes of Ad that start with 



P, and therefore tf{P,d) = j' — j + 1. Muthukrishnan[s algorithm naturally finds 



the leftmost occurrence i of each document d in C[sp,ep]. We can create another 
RMQ (now meaning range maximum query) structure over a variant of L where 
each element points to its successor rather than its predecessor. Run over this 



new RMQ, Muthukrishnan s algorithm will find the rightmost occurrence i' of each 
document d in C[sp,ep\. A final sorting of the documents is necessary to match 
the first and last occurrence of each document d, and then each frequency can be 
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Fig. 6. The structures for mapping global to local CSA positions. 

computed as explained. Since for all reasonable CSAs it holds ^^ |CSArf| < |CSA| 



we have the following result, where we use y-fast tries Willard 1983 for sorting 



Solution 13 (Document Listing with Frequencies) [Sadakane 2007| The prob- 
lem can be solved in time 0(i5earch("T-) + clocc (isA + lglg£')) cind 2|CSA| +0{n) bits 
of space, where CSA is a CSA indexing V. 



iBelazzougui et al. 2013| built on this idea to replace the document array by a 
weaker representation using less space. They use one mmphf B^ (recall Solution pi) 
for each document d, marking the positions B^[i] = 1 where C[i] — d. Then, if one 
knows that i and i' are the first and last occurrences of document d in C[sp, ep], it 
holds that tf(P, d) — ranki(i3c;, i') — ranki(i?c;, i) + 1. By combining the variants of 
Solutionis] they obtain various results, the most interesting of which follows. 



Solution 14 (Document Listing with Frequencies) [Belazzougui et al. 2013 

The problem can be solved in time 0(tsearch(?7i) + docc (tsA + Iglg-D)) o-iT'd |CSA| + 
0{nlglg\gD) bits of space, where CSA is a CSA indexing T>. 

By using the faster mmphf variant, the same technique solves the general color 
listing with frequencies problem, in optimal time. For this sake, they show how to 
avoid the final sorting step by using 0{D\gn) further bits. This space can also be 
added to Solutions 13 and 14 in order to remove the 0(docc IglgD) term in the 
time complexities. 



Solution 15 (Color Listing with Frequencies) [Belazzougui et al. 2013 The 

problem can be solved in optimal time and 0{n\g\g D + Dlgn) bits of space on top 
of array C . 

Note that, compared to Solution [TO] that listed the colors without frequencies, 
the real time has become "just" optimal, and the extra space of 0{n) bits has 
increased, yet it is still o{n\gD) as long as D = o{n). This yields a linear-space 
and optimal-time solution to the document retrieval problem. 



Solution 16 (Document Listing with Frequencies) ^Belazzougui et al. 2013 
The problem can be solved in 0(m + docc) time and 0{n\gn) bits of space. 
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Fig. 7. The wavelet tree of array C in our running example. Grayed data is not represented. 



All the solutions up to now have built over the original algorithm of Muthukr- 
ishnan [20021. It is time to introduce the wavelet tree data structure iGrossi et al. 



2003| , which is used in the first solution that uses a completely different technique. 

A wavelet tree over a sequence C[l,n] is a perfectly balanced binary tree where 
each node handles a range of the alphabet. The root handles the whole alphabet 
and the leaves handle individual symbols. At each node, the alphabet is divided by 
half and the left child handles the smaller half of the symbols and the right child 
handles the larger half. Each node v represents (but does not store) a subsequence 
Cy of C containing the symbols of C that the node handles. What each internal 
node V stores is just a bitmap i3„, where By[i] = iff C^[i] belongs to the range of 
symbols handled by the left child of f , otherwise By[i\ = 1. 

Over an alphabet [1, -D], the wavelet tree has height [IgD] and stores n bits per 
level, thus its total space is n[lg£)] bits, that is, the same of a plain representation 
of C . For it to be functional, we need that the bitmaps B^ can answer rank and 
select queries, thus the total space becomes nlgD -\- o{n\gD) bits. Within this 
space, the wavelet tree actually represents C: to recover C[i\, we start at the root 
node V. li By [i] — then C[i] belongs to the first half of the alphabet, so we continue 
the search on the left child of the root, with the new position i ^— ranko(i?^, ?). Else 
we continue on the right child, with i -(r- ranki(i?^,i). When we arrive at a leaf 
handling symbol d, it holds C[i\ = d. The process takes OilgD) time. Within this 
time the wavelet tree can also answer rank and select queries on C, as well as many 
other operations. See Makris [2012] and Navarro [2012] for full surveys. 



Example. Fig. [^ displays the wavelet tree for our array C. Since we have D = A 
symbols, the wavelet tree has 2 bitmap levels (plus the leaves, which are virtual and 
not stored). Each node shows its Cy sequence (which is not stored) in gray and its 
By bitmap (which is stored) in black. Each level stores n bits. To recover C[9] we 
start at the root node v and read By[9] — 1. Thus we go to the right child u of v, 
where the original position 9 becomes ranki(i3t,,9) — A in Sy Now Bu[A\ — 0, so 
we go to the left child of u and reach the leaf of symbol 3, thus C[9] — 3. 

A formal definition of wavelet trees follows. 



Definition 9 (Wavelet Tree) A wavelet tree over a sequence ^[l,^] on alphabet 
[1,_D] is a perfectly balanced binary tree where the ith node of height h (being the 



leaves of height 0) is associated to the symbols d such that \d/2 ~\ 



The node v 
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handling symbols [a, &] C [1,1?] represents the subsequence Sy of S consisting of the 
symbols in [a, 6]. For each node v we store a bitmap By[l, \Sy\] where By[i] = Q iff 
the left child of v is associated with symbol Sy [i] . 



Gagie et al. [2009| showed that the wavelet tree of C allows for a completely 



different document listing algorithm, based on the following basic problem (also 
known as range selection queries). 

Problem 10 (Range Quantile) Preprocess an array C[l,n] of integers so that, 
given a range [sp, ep] and an index q, we can return the qth smallest element in 
C[sp,ep]. 

[Gagie et al.| solved this problem over a wavelet tree representation of C as follows. 
We start at the root v with interval [sp, ep] . We count the number of Os in By [sp, ep] , 
with z = ranko(-B„, ep) — ranko(-Bt,, sp— 1). Now, if g < z, then the answer belongs 
to the integers handled by the left child of the root, thus we remap the interval to 
sp' = ranko(-Bi,, sp — 1) + 1 and ep' = ranko(i3u, ep) and continue recursively on the 
left child of the root. Otherwise, we remap the interval to sp' = ranki(i3.„, sp— 1) + 1 
and ep' — ranki(i?^,, ep) and continue recursively on the right child of the root, this 
time looking for q — z instead of q. When we arrive at a leaf, this is the qth smallest 
value in C[sp, ep]. The process takes 0(lg D) time. Observe that the final ep—sp+1 
value is the frequency of the gth element in the array. 

Example. We obtain the median of C[b, 11], that is, q = A. At the root node v we 
count the number ofOs in By[5, 11] using ranko(i?u, 11) — ranko(-Bt,, 5—1) = 5. Since 
5 > 4 = g, the qth element o/C[5, 11] is on the left child u of v. Thus we descend to 
u, mapping the interval [5, 11] to [ranko(i?i,, 5 — 1) + 1, ranko(_B„, 11)] = [3, 7]. Now 
we count the number of Os in _Btj[3,7] using ranko(-Btj, 7) — ranko(-Bti,3 — 1) = 2. 
Since 2 < A = q, the qth element o/Cti[3,7] is on the right child w of u. Now we 
remap the interval [3,7] to [ranki(_Btj, 3 — 1) + 1, ranki(i3„, 7)] = [2,4], and since 
we move to the right, we set q i~ q ~ 2 = 2. Now we arrive at w and, since it 
corresponds to the leaf of symbol 2, we conclude that the median of C[5, 11] is 2. 
Moreover, it occurs 4 — 2 + 1 = 3 times in C[5, 11]. 

It is possible to solve the range quantile problem in time 0(lg?T./lglgn) using 



0(n Ign) bits (assuming D = 0{n)) [Brodal et al. 2011 , and this is optimal within 



0(npolylog(n)) space J0rgensen and Larsen 2011 . In this survey we are more 



interested in solutions using sublinear extra space, thus we state that result. It is 
an open problem to obtain optimal time within sublinear extra space. 



Solution 17 (Range Quantile) [Gagie et al. 2009 The problem can be solved in 



O(lgD) time on a representation of C that uses nlgD + o{nlgD) bits of space. 



[Gagie et"aL use range quantile queries to solve document listing as follows. They 



first ask for the q = 1st quantile of C[sp, ep], thus obtaining the smallest document 
in the range, d, and its frequency, tf(P, d). Then they report d and set q <— 
q + tf (P, d). Now they ask for the qth. element in C[sp, ep], which gives the second 
smallest document in the range with its frequency, and so on, until all the documents 
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are reported. Note they do not resort to |Muthukrishnaii| s technique, nor they need 
RMQs. Moreover, they return the documents in increasing order. 

While their complexity is not competitive with our previous solutions, a simpli- 



fication of their method improves it Gagie et al. 2012 . Instead of quantile queries 



just traverse all the wavelet tree paths from the root, towards left and right children, 
stopping either when the interval [sp, ep] becomes empty, or when we arrive at a 
leaf handling an integer d, where we report document d with tf(P, d) = ep — sp + 1. 

Example. Let us list the different values in C[12, 15] using range quantile queries. 
We ask for the q = 1st element and get that it is 3, which appears 2 times. Thus 
we set g -s— q + 2 = 3 and ask for the q = 3rd element, getting 4, which occurs 2 
times. Thus we set q -^ q + 2 = 5, which is larger than our interval, so we finish. 
Let us now proceed by a recursive traversal. We map the interval C[12, 15] to the 
left child of the root, obtaining [8,7], which is empty, so there are no elements to 
report in this subtree. On the right child of tfie root, the interval is mapped to [5, 8]. 
Now we map to its left child, obtaining [3, 4] at the leaf of symbol 3, so we report 3 
with frequency 4 — 3 + 1 = 2. Now we go to the right child, mapping [5, 8] to [3, 4]. 
Now we arrive at tfie leaf of symbol 4, which we also report with frequency 2. 

It is shown that the time per item output of the recursive traversal improves as 
more documents are listed. 



Solution 18 (Document Listing with Frequencies) [Gagie et al. 2012 T/i 



le 



problem can be solved in time 0{tsearch{'ni) + docc lg(Z3/docc)) and |CSA| +nlgD + 
o(nlg_D) bits of space, where CSA is a CSA indexing V. 

We note that, while document listing could be carried out within just 0{n) extra 
bits on top of the CSA, reporting frequencies requires significantly more space. This 



is similar to what we observed for color listing. In Section 10 (Solution 34 ) we will 
show how to use the solutions for top-fc retrieval to perform document listing with 
frequencies using only o(n) bits on top of the CSA. 

7. COMPUTING DOCUMENT FREQUENCIES 

Document frequency df(P), the number of distinct documents where a pattern 
occurs, is used in many variants of the tf-idf weighting formula, as mentioned in 
the Introduction. In other contexts, such as pattern mining, it is a measure of how 
interesting a pattern is. 

Problem 11 (Document Frequency) Preprocess a document collection T> so 
that, given a pattern P , one can compute df (P), the number of documents where P 
appears. 

In terms of the document array, computing document frequency leads to the 
following problem. 

Problem 12 (Color Counting) Preprocess an array C[l,n] of colors in [1,1?] 
so that, given a range [sp, ep] , we can compute the number of distinct colors in 
C[sp,ep]. 
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Fig. 8. The grid representation of array L in our running example. 



This problem is also called categorical range counting and it has been recently 



shown to require at least ri(lg n/ Ig Ig n) time if using space 0{n polylog(n)) [Larsen 
and van Walderveen 20131. Indeed, it is not very difficult to match this lower bound 



By Lemma 1 it suffices to count the number of values smaller than sp in L[sp, ep] 
Gupta et al. 1995 . This is a well-known geometric problem, which in simplified 



form follows. 



Problem 13 (Two-Dimensional Range Counting) Preprocess an n x n grid 
of n points so that, given a range [ri,r2] x [ci,C2], we can count the number of 
points in the range. 

Our problem on L becomes a two-dimensional range counting problem if we 
consider the points {L[i], i). Then our two-dimensional range is [0, sp— 1] x [sp, ep]. 

Example. Fig. [^ shows the grid corresponding to array L in our running example, 
where we have highlighted the query corresponding to C[ll,16]. As expected, two- 
dimensional range counting indicates that there are 3 points in [0, 10] x [11, 16], and 
thus 3 distinct colors in C[ll, 16]. 



Two-dimensional range counting has been solved in 0(lgn/lglgn) time and 
-gn + o{nlgn) bits of space by Bose et al. 



also optimal within 0(npolylog(n)) space Patragcu 2007] 



2009 . Unsurprisingly, this time is 



Solution 19 (Color Counting) [Gupta et al. 1995 Bose et al. 2009 The problem 
can be solved in 0(lgn/lglgn) time and n\gn + o{nlgn) bits of space. 

We note that the space in this solution does not account for the storage of the 
color array C[l,n] itself, but it is additional space (the solution does not need to 
access C, on the other hand). Gagie et al. [2013] used this same reduction, but 



resorting to binary wavelet trees instead of the faster data structure of [Bose et al.| 
Instead, they managed to reduce the n Ig n bits of this wavelet tree, and also made 
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the time dependent on the query range (we are ignoring compressibihty aspects of 
their results). 

Solution 20 (Color Counting) [Gagie et al. 2013] The problem can be solved in 
0{lg{ep — sp + 1)) time and nlgD + o(n\gD) + 0{n) bits of space. 

Again, this resuh does not consider (nor needs) the storage of the array C itself. 



Sadakane [2007 showed that computing the document frequency is much simpler 



than the general color counting problem. He showed that one can store 2n + o{n) 
bits associated to the sufRx tree of V so that df (P) can be computed in constant 
time once the suffix tree node of P is known. 

The idea is as follows. For a node v of the suffix tree, let tf{v,d) = tf {str{v),d) 
for short. Thus, if v is the locus of pattern P, it holds tf (P, d) = tf{v,d). Now 
let us define u{v) = ^^ ^^, ^-.^^ tf (w, d) — 1. It is not hard to see that occ(P, T) == 
X]dtf(Pd)>o*f(^'^) ~ df(P)+u(w). Thus we can easily compute df(P) = occ(P,T) — 
u(v) in a suffix tree where u(v) is computed for all the nodes lHuil992 . 



Sadakane shows how to store this information compactly as follows. Value u{v) is 
also the number of times the LCA of two consecutive leaves of T^ descend from v in 
the GST, for any d. Thus, if we define h,(w) as the number of times the LCA of two 
such consecutive leaves is exactly w, it holds u{v) = Yl,w descends from i; ^{w). By 
storing the h(w) values in preorder, we can compute u(v) by adding up all the h,(w) 
values in a contiguous range. Furthermore, one can see that each pair of consecutive 
leaves of the same document adds 1 to some h(w) value, and thus ^Yliw ^T-iw) — 
n—D < n. This allows us to store all the h{'w) values in unary, that is, concatenating 
-^h{w)Q ^p ^ bitmap i/[l, 2n], so that if p is the preorder of node v and s is the number 
of nodes in its subtree, we have u{v) = selecto(i/,p + s — 1) — selecto(_ff,p — 1) — s. 
The suffix tree topology can also be represented with 0{n) bits so that the required 
operations on the suffix tree, including finding the node v that covers the suffix 
array interval [sp, ep], can be carried out in constant time, recall Solutionis] 

Example. Fig. ^illustrates the data structure on our example GST. We shown and 
h besides each internal node, and the bitmap representation H on the bottom. For 
example, "ma", with locusv, occurs in A^{" ma") ~ occ{" ma" ,T) — u{v) = 4—1 — 3 
distinct documents, where occ( "ma", T) is just the number of leaves below v. 

On top of any CSA, this yields the following solution. 
Solution 21 (Document Frequency) (Sadakane 2007 The problem can be solved 



in 0(tsearch("T-)) time using |CSA| + 0{n) bits of space. 

8. MOST IMPORTANT DOCUMENT RETRIEVAL 

We move on now from the problem of listing all the documents where a pattern 
occurs to that of listing only the k most important ones. In the simplest scenario, 
the documents have assigned a fixed importance, as defined in ProblemlSJ By means 
of the document array, this can be recast into the following problem on colors. 

Problem 14 (Top-fc Heaviest Colors) Preprocess an array C[l,n] of colors in 
[1,D] with weights in W[l,D] so that, given a range [sp,ep] and a threshold k, we 
can output k distinct colors with highest weight in C[sp,ep]. 
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Fig. 9. The data structure to compute document frequencies (only H and the suffix tree topology 
are represented) on our running example GST. 



A first observation is that, if we reorder the colors so that their weights W become 
decreasing, W[d\ > M^[(i + 1], then the problem becomes that of reporting the k 
colors with lowest identifiers in C[sp,ep]. This is indeed achieved by Gagie et al. 
I [2009] using consecutive range quantile queries, as explained (recall Solution 17), 
and it is easy to adapt the subsequent improvement Gagie et al. 2012 to stop 



after the first k documents are reported. It is not hard to infer the following result, 
where the wavelet tree can also reproduce any cell of C (and thus replace C, if we 
accept its access time). 



Solution 22 (Top-Zc Heaviest Colors) [Gagie et al. 2012 The problem can be 
solved in 0{k \g{D/k)) time and nig D + o{nlg D) bits of space. Within this space 
we can access any cell of C in time 0(\gD). 

Note that this assumes that we can freely reorder the colors. If this is not the case 
we need other D\gD bits to store the permutation. On the other hand, by spending 
linear space (O(nlgn) bits), we can use the improved range quantile algorithms of 



Brodal et al. [20111. In this case, however, an optimal-time solution by Karpinski 



and Nekrich [20111 is preferable. 



Solution 23 (Top-Zc Heaviest Colors) [Karpinski and Nekrich 2011 The prob- 
lem can he solved in real time and 0{n\gD) bits of space. 

As a consequence, we obtain the following results for document retrieval prob- 
lems. The first uses a suffix tree, whereas the second considers CSAs. 



Solution 24 (Top-/c Most Important Documents) [Karpinski and Nekrich 20TT 
The problem can be solved in 0{m + k) time and 0{n\gn) bits of space. 



Solution 25 (Top-/c Most Important Documents) [Gagie et al. 2012| The prob- 
lem can be solved in time 0{tss3rcu{jn) -\- k lg{D/k)) and |CSA| -|- nlgD -I- o{n\gD) 
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bits of space, where CSA is a CSA indexing 7). 



We defer the results using compressed space to the end of Section 10 (Solution 31 1 
9. TOP-i^ DOCUMENT RETRIEVAL IN LINEAR SPACE 



Hon et al. [2009 introduced a fundamental framework to solve the more general 



variant of top-fc document retrieval. Basically all of the subsequent work can be 



regarded as improvements that build on their basic concepts (see Hon et al. [2010a| 
for prior work). 

Their basic construction enhances the GST of the collection V so that the local 
suffix tree of each document Td is embedded into the global GST. More precisely, 
let u be a node of the suffix tree of document T^, and let w be its parent in T^. 
Further, let occ(u) — tf {u,d) be the number of leaves below u in the suffix tree of 
Td- There must exist nodes u' and w' in the GST of T) with the same string labels, 
str(u') — str(u) and str(w') — str{w). Then we record a pointer labeled d from 
u' to w', with weight occ(u). Thus the set of parent pointers for each document d 
forms a subgraph of the GST that is isomorphic to the suffix tree of Td- From the 
fact that the pointers labeled d correspond to an embedding of the suffix tree of Td 
into the GST, the following crucial lemma follows easily. 



Lemma 2 Hon et al. 2009 Let v be a nonroot node in the GST ofT). For each 
document d where str{v) occurs tf (w, d) > times, there exists exactly one pointer 
from the subtree of v (including v) to a proper ancestor of v, labeled d and with 
weight ti{v, d). 

To see this, note that if str{v) occurs in Td, there must be a node u of the suffix 
tree of Td mapped to u' in GST, which is below v (or is v itself). If we follow the 
successive upward pointers from u', we go over v at some pointFlso there must be 
at least one such pointer from a subtree of w to a proper ancestor of v. On the 
other hand, there cannot be two such pointers leaving from m" and u'" , because the 
LCA of both nodes must also be in the suffix tree of Td, and hence u" and u'" must 
point to this LCA or below it. But since u" and u'" descend from v, their LCA 
also descends from v, so their pointers point at or below v. Notice, moreover, that 
the source u' of this unique pointer has a weight tf{u,d), which must be ti{v,d), 
since all the occurrences of v in document d are in nodes below m' in the GST. 



Example. Fig. 10 illustrates how the suffix tree of document Ti is embedded in 
the GST, in our running example. The upward pointers describe the topology of 
the local suffix tree. Note that for all the patterns that appear in Ti, such as "ma" , 
"mi ma", and "ma ma", but not "la" nor "me", there is exactly one upward pointer 
leaving from the .subtree of the locus and arriving at an ancestor of the locus. 



Hon et al. [2009] store the pointers at their target nodes, w' . Thus, given a 



pattern P, we find its locus v in the GST, and then the ancestors vi,V2, . . ■ of w 
record exactly one pointer labeled d per document where P appears, together with 
the weight tf{P,d). However, we only want those pointers that originate in nodes 



'^Since T^ ^ e, there are at least two distinct symbols in T^S, and thus its root is mapped to the 
root of the GST. So we always cross v at some point. 
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Fig. 10. The embedding of the suffix tree of Ti in the global GST. The nodes are shown grayed 
and the upward pointers as thick arrows. 
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Fig. 11. The linear-space schemes for top-fc documents. On the left, the GST with the locus node 
shadowed. From the arrays of its ancestors (targets of pointers) we choose the areas (shadowed) 
of those leaving from the subtree of the locus. We draw the pointers that go to the grandparent of 
the locus. Solid arrows leave from the subtree of the locus and dashed ones from other descendants 
of the grandparent node. On the right, the mapping of those arrays into a grid, where the row is 
the depth of the target and the column is the preorder of the source. 



u' within the subtree of v. For this sake, the pointers arriving at each node w' 
are stored in preorder of the originating nodes u' , and thus those starting from 
descendants u' of v form a contiguous range in the target nodes w' . Furthermore, 
we build RMQ data structures (this time choosing maxima, not minima) on the 
tf{P,d) values associated to the pointers. Fig. 11 (left) illustrates the scheme. 
The node v has at most m ancestors. In principle we have to binary search those 
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Fig. 12. The locus of P = " ma" and the area of the Ust that corresponds to it. 



m arrays of pointers to isolate the ranges of the pointers leaving from the subtree of 
V. A technical improvement reduces the time to find those ranges from 0{mlgD) 
to 0{m) time, let Vi[spi,epi\ be those ranges. Then, with an RMQ on each such 
interval we obtain the positions pi where the maximum weight in each such array 
occurs. Those m weights are inserted into a max-priority queue bounded to size k 
(i.e., the (fc + l)th and lower weights are always discarded). Now we extract the 
maximum from the queue, which is the top-1 answer. We go back to its interval 
Vi[spi, epi] and cut it into two, Vi[spi,Pi — 1] and Vi[pi + 1, epi], compute their RMQ 
position, and reinsert them in the queue. After repeating this process k times, we 
have obtained the top-fc documents. 



Solution 26 (Top-Zc Documents) [Hon et al. 2009 The problem can he solved 
in 0{m + klgk) time and 0{n\gn) bits of space. 



Example. Fig. 12 shows the arrays of target nodes, in the format {d,tf{u',d)), 
sorted by preorder of the source nodes but omitting the preorder information for 
clarity. We shadow the locus node v of P = "ma" . In this small example the 
locus has only one proper ancestor, the root vi. The range vi[spi,epi] — wi[7, 9] 
in that array corresponds to the pointers leaving from the subtree of v. An RMQ 
structure over this array lets us find in constant time the top-1 answer, wi[7] = {d = 
l,tf(P, d) = 2). To get the top-2 answer we split the interval into wi[7, 6] (empty) 
and vi[8, 9], and pick the largest of the two. this case). 

Note that the weights tf (P, d) can be replaced by any other measure that is a 
function of the locus of P in the suffix tree of T^. This includes some as sophisticated 
as dmin(P, d), the minimum distance between any two occurrences of P in d. 



Navarro and Nekrich [2012 improved the space and time of this solution by using 



a different way of storing the pointers. They consider a grid of 0{n) x 0{n) points so 
that a pointer from node u' to node w' is stored as a point {depth(w'), preorder{u')) 
in this grid, associated to the document d and with weight tf (m', d). Then, once the 
locus I) of P is found in the suffix tree, the problem is reduced to that of finding 
the k heaviest points in the range ^,depth(v) — 1] x [preorder {v), preorder {v) + 
subtreesize{v) — 1] on the grid (note that there may be several points in a single 
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Fig. 13. The grid representation of the pointers and the query. 



place of this grid, which can be dealt with by creating unique columns for them). 

(right). 



11 



This is, again, a geometric problem. We illustrate it in Fig 

The key to achieving optimal time is to note that the height of the query range 
is at most m, and we have already spent 0(m) time to find the pattern. Navarro 



and Nekrich [2012] show that, if we can spend time proportional to the row-size of 
the query range, then it is possible to report each top-k point in constant time. In 
addition, they manage to slightly reduce the space. 



em can 



Solution 27 (Top-k Documents) [Navarro and Nekrich 2012 The probl 
be solved in 0{m + k) time and 0{n{\gD + Igcr)) bits of space. 

Other weights than tf (P, d) can also be used in this solution, yet the space be- 
comes 0(n(lg £> -I- Ig tr -I- Ig Ig n)) bits. 



Example. Fig. \13\ illustrates \Navarro and Nekrich\ 's representation of the pointers 
on a grid. In our example the grid is just of height two, and there is no more than 
one pointer per cell. The query area is shaded. 

We note that Solutions [26| and [27| are online, that is, we do not need to specify k 
in advance, but can run the algorithms and stop them at any time, after reporting 
k documents. Interestingly, the optimal online solution immediately yields opti- 



mal solutions to various outstanding challenges left open by Muthukrishnan [2002 



First, if we want to list all the documents where a pattern appears more than t 
times, we simply run the online algorithm until it outputs the first document d with 
tf (P, d) < t. Second, if we want to list all the documents where two occurrences 
of the pattern appear within distance at most i, we do the same with the dmin 



weighting function. Muthukrishnan [20021 had obtained sophisticated solutions for 



those problems, which were optimal-time and linear-space only for t fixed at index 
construction time. Now we can solve those problems as particular cases, in linear 
space and optimal time, with the ability to define t at query time (and even in 
online form). 

We note, however, that |Muthukrishnan[ s solutions work for general color range 
problems, not only for document retrieval. This opens the question of whether we 
can also solve the corresponding top-A: problem on colors. 



Problem 15 (Top-fc Colors) Preprocess an array C[l,n] of colors in [1,1?] so 
that, given a range [sp, ep] and a threshold k, we can output k colors with highest 
frequency in C[sp, ep]. 

As far as we know, there exists only an approximate solution to this problem 



Gagie et al. 2013 . This means that, given an error threshold e, one ensures that 
no ignored color occurs more than 1 -f- e times than a reported color. 
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Fig. 14. The process of solving top-A; colors. 



Solution 28 (Top-k Colors) [Gagie et al. 2013] An {1 + e)- approximation for the 
problem requires 0{klgDlg(l/e)) time and 0{{n/e)\gD\gn) bits of space. 

The technique used is interesting by itself. It composes an approximate solution 



for top-1 color in 0(lg(l/e)) time and 0{{n/e) Ign) bits of space Greve et al. 2010 
with a wavelet tree on C, so that one such structure is stored for each sequence Sy 
associated to wavelet tree node v. For the top-1 answer, say di, it is sufficient to 
query the root. For the top-2 answer, ^2, we must exclude di from the solution. 
This is done by considering all the siblings of the nodes in the path from the root 
to the leaf di of the wavelet tree. The maximum over all those top-1 queries gives 
the top-2. For the top-3, we must repeat the process to exclude also ^2 from the 
set, and so on. We maintain a priority queue with all the wavelet tree nodes that 
are candidates for the next answer. After returning k answers, the queue has klgD 
candidates (as the root-to- leaf paths are of length \gD). By using an appropriate 
priority queue implementation, the result is obtained. 



Example. Fig. 14 illustrates the process. On the left, we query C[4, 13], finding 
on the root that 4 is the most frequent color. To find the top-2 result, we remove 4 
from the alphabet by partitioning the root node (which handles symbols [1,4]) into 
two wavelet tree nodes: that handling [1,2] and that handling [3] (on the right). 
Now we perform the query on both arrays, finding that 2 is the most frequent color 
from both nodes. If we wanted the top-3, it would be decided between wavelet tree 
leaves handling [1] and [3]. 

This result is not fully satisfactory for two reasons. First, the space is super- 
linear. Second, it is approximate. The second, limitation, however, seems to be 
intrinsic. ?] show that it is unlikely, even for fc = 1, to obtain times below Q{n^/^~^) 
using space below 0(n"'^), where w is the matrix multiplication exponent (best 
known value is w = 2.376). This is a case where document retrieval queries, which 
translate into specific colored range queries (where the possible queries come from 
a tree) are much easier than general colored range queries. 

10. TOP-K DOCUMENT RETRIEVAL IN COMPRESSED SPACE 

Just as it happened for the optimal document listing solution, the optimal solution 
for top-fc documents may use too much space on large document collections, so one 
seeks to reduce its space even at the price of a suboptimal query time. 



In their seminal paper, Hon et al. [2009 also presented the first compressed 



solution to the problem. The idea is to sample some suffix tree nodes and store the 
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top-fc answer for those sarapled nodes. The saniphng mechanism guarantees that 
this answer must be corrected with just a small number of sufRx tree leaves in order 
to solve any query. 

Assume for a moment that k is fixed and let 6 = fc Ig ^^ n. We choose the GST 
leaves whose suffix array position is a multiple of b, and mark the LCA of each 
pair of consecutive chosen leaves. This guarantees that the LCA of any two chosen 
leaves is also markedP] At each of the n/b marked internal nodes v, we store the 
result top{'D,str{v),k). This requires 0(A;lgn) bits per sampled node, which adds 
up to 0(n/ Ig '^'^ n) bits. The subgraph of the GST formed by the marked nodes 
(preserving ancestorship) is called r^. 

Instead of storing the GST, we store the trees Tk, for all k values that are powers 
of 2. All the Tk trees add up to 0{n/\g'^ n) = o{n) bits. Given a top-A: query, we 
solve it in tree r^/, where k' — 2r's'=l = 0{k) is the power of 2 next to k. 

Just as the pattern P has a locus node v in the GST, it has a locus v' in t^', 
using the same Definition |4j It is not hard to see that, since the nodes of Ty are 
a subset of those of the GST, v' must descend from v (or be the same). A way to 
find v' is to use a CSA and obtain the range [sp, ep] of P, then restrict it to the 
closest multiples of b' = k' Ig "^^ n, [sp", ep"] C [sp, ep], then take the (sp"/6)th and 
(ep"/6)th leaves of Tk', and finally take v' as the LCA of those two leaves. The 
following property is crucial. 

Lemma 3 [Hon et al. 2009] The node v' G Tk' covers a range [sp' , ep'] such that 
[sp",ep"] C[sp',ep'] C [sp,ep]. 

It holds that [sp" , ep"] C [sp' , ep'] because v' is the LCA of leaves sp" and ep" in 
the GST, and it holds that [sp', ep'] C [sp, ep] because v is the LCA of [sp, ep] and 
it is an ancestor of v' . Moreover, note that sp' — sp < b and ep — ep' < b. 

Node v' has precomputed the top-fc answer for [sp',ep'] (moreover, the top-fc' 
answer). We only need to correct this list with the documents that are mentioned 
in A[sp, sp' — 1] and A[ep' + 1, ep], and those ranges are shorter than b. It is also 
possible that [sp" , ep"] = 0, but in this case it holds that [sp, ep] is shorter than 26 
and thus the answer can be computed from scratch by examining those 0{b) cells. 



Fig. 15 illustrates the scheme. The locus v is in gray and the marked node v' in 
black. The areas that must be traversed sequentially are in bold. 

In the general case, we have up to k precomputed candidates and must correct 
the answer with 0{b) suffix array cells. We traverse those cells one by one and 
compute the corresponding document d using A and A'^ as in Solution 



13 



Now, 

each such document d may occur many more times in C[sp' , ep'], yet be excluded 
from the top-fc precomputed list. Therefore, we need a mechanism to compute its 



frequency in C[sp, ep]. Note that we cannot use the technique used in Solution 13 
because we do not have access to the first and last occurrence of d in C'[sp, ep]. 

We take advantage of the fact that we do have access to either the first (if we 
are scanning [sp,sp' — 1]) or the last (if we are scanning [ep' + l,ep]) position of 
d in C[sp,ep]. Say it is its first position, i. We map this position to Arf[i'], and 



*We are presenting the marking scheme as simplified by [Navarro and Valenzuela [2012] , where 
this property is proved. 
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Fig. 15. The compressed top-fc retrieval scheme. 

carry out an exponential search for the largest position j' > i' such that ^db'] 
is mapped back to some A[j] with j < ep. When we find such j', we know that 
tf (P, d) = j' — i' + 1 and can consider including d in the top-fc candidate list (note d 
might already be in the list, in which case we have to update its original frequency). 
This costs 0{tsA Ig n) per processed cell, for a total cost of 0{b isA Ig n) time to solve 
the query. By using a compressed bitmap representation (Solutionis]) for the bitmap 
B that marks the document beginnings, we obtain the following result. 



Solution 29 (Top-k Documents) [Hon et al. 2009| The problem can be solved 
in time 0(isearch(m) + fc^sA Ig^"^' ^) and 2\CSA\ + Dlg{n/D) + 0{D) + o{n) bits of 
space, where CSA is a CSA indexing V, for any constant e > 0. 

Note that the space simplifies to 2|CSA| + o{n) ii D = o{n). There have been 



several technical improvements over this idea Gagie et al. 2013[ Belazzougui et al 



2013 . The best current results, however, have required deeper improvements. 
One remarkable idea arised when extending the mmphfs of Solution [Ml to top-fc 



retrieval. The idea of Belazzougui et al. [2013| is that, if a document d occurs in the 
left {[sp, sp' — 1]) and the right {[ep' + 1, ep]) tails of the interval, then we know its 
first and last occurrence in C[sp, ep], and thus the mmphfs can be used to compute 
tf (P,d) fast. The proble m are the documents d that appear only in one of the tails. 
Belazzougui et al. [2013 prove, however, that there can be only k + ^/2bk elements 
of this kind that can make it to the top-fc list. 

To see this, let fmin be the fcth frequency in the top-fc stored set. Then all the 
other documents have frequency < fmin- The first fc documents of the tail can 
immediately enter the list, if they now reach frequency fmin + 1- However, the 
next documents to enter the top-fc list must now reach frequency fmin + 2, and 
thus we need to scan at least 2fc cells of the tail to complete the next batch of fc 
candidates. Similarly, the next fc candidates require scanning at least 3fc cells to 
reach frequency fmin + 3, and so on. To incorporate sfc elements we need to scan 
n(s^fc) cells. Since we scan at most b cells, the bound 0{\/bk) follows. 
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With some care, the frequency of all those potential candidates can be stored 
as well (for example, their frequency must be in the narrow range [fmin — b + 
1, fmin]-, and instead of storing the document identifiers d we can mark one of their 
occurrences in [sp, sp' — 1], of length at most &; we can then obtain d with the CSA. 



Very recently, iTsur [2013] improved this result further, by noticing that the 



idea of limiting the number of candidates can be extended to the case where the 
document appears in both tails. This is because the only interesting ranges are those 
that correspond to GST nodes, and the leaves covered by the successive unmarked 
ancestors of v' (until reaching the nearest marked ancestor) form 0(b) increasing 



sets of leaves, so the reasoning of Belazzougui et al. [2013 applies verbatim. 



The surprising result is that all the possible candidates for the nonmarked nodes, 
and their frequencies, can be precomputed and stored. There is no need at all to 
use mmphfs (nor local CSAs or document arrays) to solve a top-fc query! This 
yields the first result for this problem with essentially optimal space, and moreover 
very competitive time. 



Solution 30 {Top-k Documents) [Tsur 2013) The problem can be solved in time 
0(isearch(™)+A:(tsA+lgfc+lglgn)lgA:lgn(lglgn)4) and\C5A\+Dlg{n/D)+0iD) + 
o{n) bits of space, where CSA is a CSA indexing V. 

Assuming we use a CSA with isA = O (Ig n), the time simplifies to O (^search {m) + 
k\gk\g n). Assuming D = o{n), the space simplifies to the asymptotically opti- 
mal jCSAj + o{n). This space had not been even achieved for document listing! 



On the other hand, Belazzougui et al. [2013] show that these ideas can be applied 



to solve the top-fc most important problem in compressed space. In this case they 
sort the document identifiers by decreasing weight, and each marked node in Tk 
stores simply the k smallest document identifiers in the range. There is no need 
of the individual CSAs to compute term frequencies. Further, they speed up the 
traversal of the blocks of size 0{b) by subsampling them and creating minitrees 
T^/ inside each block. Therefore, instead of collecting the candidates from at most 
one tree and traversing two blocks, we must collect the candidates from at most 
one tree, two minitrees, and two subblocks, the latter being sequentially traversed. 
Those minitrees store the top-fc' answers for selected nodes, just as the global one, 
with the difference that instead of a document identifier, they store a CSA position 
inside the block where such document appears, as before. This allows them to 
encode each document identifier in O(lglgn) bits and thus use smaller miniblocks. 

Solution 31 (Top-fc Most Important Documents) [Belazzougui et al. 2013| 

The problem can be solved in time 0(i5earch("^)+fc^SA Ig ^Ig"^ ^) o,i^d |CSA|-|-I?lg(ri/D)- 
0{D) +o{n) bits of space, where CSA is a CSA indexing T> , for any constant e > 0. 

With the above assumptions on tsh and D, this simplifies to 0{tse3rcu{'m) + 
klgklg^^'^ n) time and jCSAj + o[n) bits. The space is asymptotically optimal 



and the time is close to that for document listing (Solution 11 ). 



Very recently, Hon et al. [2013] translated this result back into top-fc (most fre 



quent) document retrieval. The solution of Belazzougui et al. [2013| does not work 



for this problem because one cannot easily compose two (or, in this case, three) 
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partial top-fc most frequent document answers into the answer of the union (as a 
global top-A: answer could be not a top-Zc answer in any of the sets) . This worked 
for the top-fc most important document problem, which is easily decomposable. 
However, Hon et al. 2013] consider the trees and the minitrees in a slightly 



different way. There are two kinds of trees, the original ones, r^, and a new set of 
(also global) trees, pk. This new set of trees uses shorter blocks, of length c < b. 
For each node v € pk, 'we consider the highest node u € Tk that descends from v 
(there is at most one such highest node, because r^ is LCA-closed). Further, the 
area covered by v is wider than that of u by at most b leaves on each extreme. For 
V they store only the top-fc answers that are not already mentioned in the top-fc 
answers of u. Those answers must necessarily appear at least once outside the 
area of u, and thus they can be encoded, similarly to Belazzougui et al. [2013| , as 
an offset of O(loglogn) bits. Their frequency is not stored, but computed using 



individual CSAs as in Solution 29 Then a top-fc query requires collecting results 
from one Tk tree node, from one pk tree node, plus two traversals over 0{c) leaves. 
Overall, they obtain the same result of Belazzougui et al. 2013|, but now for the 



most difficult top-fc most frequent document retrieval. 



Solution 32 (Top-fc Documents) [Hon et al. 2013| The problem can be solved 
in time 0(i5earch("^) + fc ^SA Ig ^ Ig' "-) and 2|CSA| + D\g{n/D) + 0{D) + o{n) bits 
of space, where CSA is a CSA indexing V, for any constant e > 0. 

Again, with the above assumptions on isA and D, this simplifies to O {t search {fn) + 
fclgfclg '^'^ n) time and 2|CSA| + o{n) bits. This is the best time that has been 
achieved for top-fc retrieval when using this space. It is natural to ask whether it is 
possible to obtain the optimal space of Solution [30] combined with the time of this 
solution, and moreover if the ideal time 0(isearch('Ti) -t- fc^sA) can be achieved. 

There have been, on the other hand, much faster solutions using the n\gD bits 
of a document array Gagie et al. 2013 Belazzougui et al. 2013 . The best solution 



from this family Hon et al. 2012a is of completely different nature: They start 
from the linear-space solution of Hon et al. 2009] and carefully encode the various 
components. Their result is significantly faster than any of the other schemes. (We 
omit an even faster variant that uses 2n\gD bits.) 



Solution 33 (Top-fc Documents) 

in time 0(isearch(™) + (Iglg")^ + fc(lgcrlglgn)^+ 



Hon et al. 2012a The problem can be solved 
and |CSA| + n\gD + o{n\gD) 



bits of space, where CSA is a CSA indexing T>, for any constant e > 0. 

10.1 An Optimal-Space Solution to Document Listing 

After finding, in Section [s] that achieving the optimal |CSA| + o{n) bits of space 
was an open problem, it is surprising that this is possible for the top-fc problems, 
which seem harder in principle. In particular, using Solution 31 with sufficiently 
large fc should return all the distinct documents where P appears. We are now 
in condition to close the problem we opened at the end of Section [5] achieving 
asymptotically optimal space for document listing. More surprisingly, we will show 
how to compute the term frequencies of the documents listed, still within optimal 
space, s promised at the end of Section [6] 
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We build on the ideas for compressed top-A: documents to develop a new (and 
now correct) solution for document listing within essentially optimal space. Our 
solution is slower than the (incorrect) one of Hon et al. 2009| , but in exchange it 



outputs the term frequencies of the documents reported. 



Solution 34 (Document Listing with Frequencies) The problem can be solved 
in time 0(tsearch(™) + docctsA Igdocclg^"'''^ n) and \CSA\ + Dlg{n/ D) + 0{D) + o{n) 
bits of space, where CSA is a CSA indexing V, for any constant e > 0. 

Very roughly, the idea is to use Solution [29] and ask for the top-fc documents in 
the range, for successive powers of 2 for k until all the solutions in the range are 
returned. However, a few refinements are necessary. We give the details now. 

Let bk be the block size for tree r^, for k = 1,2,4, ... ,D. After determining 
[sp, ep] in 0(isearch(''T^)) time, we obtain the corresponding locus of P in the succes- 
sive trees r^, each in constant time using an LCA operation, as explained, on the 
maximum 6fe-aligned range [sp",ep"] contained in [sp,ep\. We continue until the 
locus for some Tk stores less than k candidates, which means that there are less 
than k distinct documents in the range, or when reaching k = D. At that point, all 
the distinct documents in [sp',ep'] are in the candidate list, with their frequencies 
already computed. We only have to traverse the 0{bk) elements of A that are in 
[sp,ep\ \ [sp' ,ep'] and add them to the result. To manage the set of candidates, 
insert the new elements that appear, increase the counters of the elements visited, 
and finally collect them all for listing, we store the document identifiers in a y- 



fast trie Willard 1983 , which offers 0(lglgl?) amortized time per insertion and 
worst-case time for queries. This time will be absorbed by a Ig*^ n factor. 

The bulk of the space comes from storing the list of candidates with their fre- 
quencies, which requires docclgn < k\gn bits per list. Let bu = kik- Then, 
since tree t^ has 0{n/b}~) nodes, the sum of the space for the lists stored in it 
is 0{{n/bk){k\gn)) = 0((n/£fc) Ign). By choosing Ik — Ig^lg^^''^, the space 
for Tk is 0{n/{\gk\g'^ n)), which summed up over all the trees tj, for fc = 2* is 
0(n/lg'n) ^-^Q 1/z = 0(n IglgU/lg'^n) — o{n). The topology of each tree r^ 
requires 0((n/&fc)lgn) further bits, which is asymptotically irrelevant. 

If we try the successive powers of 2 for k, we will find a tree node v' G Tk with the 
frequencies of all the docc' < docc distinct documents in C[sp' , ep']. Since this was 
not the case for Tk/2, it follows that fc/2 < docc, and thus k = O(docc). Therefore, 
we will traverse 0{bk) cells, where bk = 0{k\gk\g^^'^ n) = 0(docclgdocclg^'''' n). 

If we do not need to output (nor store) the term frequencies, it is possible to 
obtain a better result by using the same technique of probing successive powers of 
2 for k, but changing our encoding above by the one used by [Belazzougui et ah 
I [2013] for top-fc most important documents (Solution 31). 



Solution 35 (Document Listing) The problem can be solved in time 0(t5earch("T-) + 
docctsAlgdocclg' n) and\CSk\+D\g{n/D) + 0{D)+o{n) bits of space, where CSA 
is a CSA indexing T>, for any constant e > 0. 
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11. PRACTICAL DEVELOPMENTS 

Many of the theoretical developments we have described are simple enough to be 
implement able, in some cases after some algorithm engineering tuning. In this 
section we describe the results obtained by the implementations we are aware of. 

11.1 Document Listing with Term Frequencies 



There are few doubts that Solution 11 is a good practical method for plain document 
listing. The situation, however, is not so clear for Solution |13[ which gives the term 
frequencies of the listed documents. This solution was independently implemented 
in two occasions Culpepper et al. 2010 Navarro et al. 2011 and found in both cases 



to use much more space than expected. This is probably because the individual 
structures CSA^ pose a constant-space overhead that is significant for relatively 
small documents (see a discussion of implementation issues of CSAs in a previous 

While this solution is expected to be slower than 



survey 



Ferragina et al. 2009 



using an explicit document array, the fact that it also uses more space might be an 
artifact that can be solved with a more careful implementation that represents all 
the CSAd arrays as a single structure. 

Valimaki and Makinen [2007| presented the first experimental results showing 



that, as expected, the document array was a fast but space-consuming structure 
for this problem (Solution [l2|). Culpepper et al. [2010 implemented the quantile- 
based listing algorithm (Solution 17), showing its superiority in time and space over 



Solutions 13 and [T2j yet space was still an issue. They also considered a baseline 
solution based on storing inverted indexes for g-grams, which was also inferior. 
iNavarro et al. [201 1| introduced a technique to compress the wavelet tree of the 



document array. The idea is based on an observation by [Gonzalez and Navarro] 
[ 2007] , who compressed suffix arrays by differentially encoding and then grammar- 
compressing them. The reason why this work has deep roots (see a thorough 
discussion in Navarro and Makinen 2007] ), but it can be summarized as: repetitive 
texts induce long areas in the suffix array that are identical to other areas, yet 
with the values shifted by 1. In a document array, most of those areas become just 



plain repetitions, as the suffix shifted by 1 usually falls in the same document Gagie 



et al. 2013 . However, just grammar-compressing the document array is not enough, 
because one needs rank operations on the sequence in order to run the document 
listing algorithm on it. Navarro et al. [2011] grammar-compressed the bitmaps 



By stored at the wavelet tree nodes, when this was convenient (they showed that 
the repetitions at the root level faded away at deeper levels). They also replaced 
the quantile-based Solution [17| by the always faster and simpler Solution [TS} As a 
result, they obtained document listing solutions that were about twice as slow and 
required about half the space, when the text collection was compressible. 

Finally, Belazzougui et al. 2013) implemented Solution 14 based on mmphfs. 



Although it lost to the compressed wavelet trees on compressible texts, it offered a 
more stable performance, using less space than uncompressed wavelet trees for all 
the texts, regardless of their compressibility. Their time performance, however, is 
significantly worse. 

To give some rough numbers on a typical desktop computerjj in the collections 



^Those arc still unpublished results, newer than those in Navarro and Valenzuela [2012 
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tested the uncompressed document array requires 13-18 bits per character (bpc), 
and solves document hsting queries of extent ep — sp = 10 to 100, 000 in 5-50 
miUiseconds (msec). The compressed document arrays require 8-12 bpc and 10- 
50 msec per query. The solution based on mmphfs always requires 7-13 bpc and 
around 60 msec. The implementation using individual CSAs requires around 60 
bpc and 500 msec. Those spaces are in addition to the global CSA, which takes 
4-8 bpc depending on the compressibility of the collection. It is likely that an 



implementation of Solution 34 will require less space than all these solutions, yet 



probably it will be significantly slower than using wavelet trees. 
11.2 Top-fc Documents 



The most interesting ideas of Culpepper et al. [20101 were two heuristics for top-fc 



document retrieval that used just the plain document array. The best was a pri- 
oritized document listing that took advantage that, as we backtrack in the wavelet 
tree, we can know the sums of the tf (P, d) values over all the documents d repre- 
sented in each wavelet tree node (this is just the size of the interval [sp, ep] on the 
node). In the traversal they gave priority to the nodes with higher ep— sp+1. They 
kept in a priority queue the wavelet tree nodes about to be traversed (initially just 
the root). Then they extracted the node with largest ep — sp + 1 value. If it was 
a leaf, they reported the corresponding document. Otherwise they inserted both 
children in the queue. Then the first fc leaves extracted were the answer. Later, 



Culpepper et al. [2012| showed that the idea works well at a much larger scale and 



even competes, up to some degree, with inverted-index top-fc based solutions, on 
natural language text collections. See also the preceding work by Patil et al. [201 1|. 



The first implementation of the succinct top-fc framework of Hon et al. [2009] 



was by Navarro and Valenzuela [20121. They focused on Solution 29 yet using 



a document array instead of the individual CSAs (which, as explained, did not 
work well in practice). They made several optimizations, in particular replacing 
the brute-force scanning of the blocks by an adaptation of the prioritized document 
listing of Culpepper et al. [2010| . This traversal is stopped as soon as it delivers 



documents with frequencies below the fcth candidate already known. 

Another optimization was to factor out the redundancies that arise because the 
same node may store the top-fc answers in some r^ and then the top-2fc answers in 
T2k- They store all the top-fc precomputed solutions in n (which contains the other 
trees) for the maximum fc where each node stores answers. The other trees store 
only their topology plus pointers to ti. They use a simpler marking method that 
in particular does not store suffix tree leaves in the r^ trees, only their LCAs. This 
reduces the space to about one half, but forbids using LCAs to find the locus node 
of P in Tfc. The locus is instead found by traversing the nodes of r^ from the root 
and binary searching the children according to the interval they cover (looking for 
[sp, ep]). This is slower compared to an LCA but its impact is minimal. 

The experiments compare all these solutions, including several variants to store 
the Tfc trees and representations of the document array (including one based on 
Solution |4J where only the basic brute-force block traversal algorithm can be im- 



plemented). The results show that adding the data structures of iHon et al. [2009 



pose very little extra space and considerably improves upon the time performance 



of the basic heuristics Culpepper et al. 2010 . They also show that the improve 
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ments make a significant difference witli brute-force scanning, even when this is 
implemented over the faster sequence representation of Solution [4J The different 
variants of wavelet trees, compressed or not, dominate the time/space map. 

To give some rough numbers on a typical desktop machinej^a top-fc query takes 
about k msec using 8-14 bpc with compressed wavelet trees on compressible texts, 
or in 0.02fc-0.1A; msec using 10-18 bpc with plain wavelet trees, on any text. The 
basic heuristics without the structures r^ use almost the same space, but require 
more time (in some cases very little, in other cases up to 10 times more). The 
brute-force solution on the fast sequence implementation requires as much as 18-24 
bpc and O.Olk-O.bk msec. Again, in all those solutions we have to add about 4-8 
bpc for the global CSA. 

Recently, some attempts at directly implementing the linear space solutions have 



been made. Konow and Navarro [2013] implemented Solution 27 replacing the com 



plex optimal-time algorithm on the grid by the simpler solution for two-dimensional 
top-fc retrieval of Navarro et al. 20131. The fact that the height of a suffix tree is 



O(lgn) with high probability (whp) on a wide variety of assumptions on the text 
lets them state an attractive time complexity oi 0{m + {k + Iglgn) Iglgn) (whp). 
They also improve the space by removing from the grid the points with term fre- 
quency 1 (this idea was also mentioned by Hon et al. 2012a| ). If the grid returns 
less than k answers, a normal document listing on the document array is used to 
obtain further documents to complete the answer. Overall, their data structure 
uses 16-24 bpc and answers queries in 0.001fc-0.004fc msec. This is generally faster 
than solutions based on the compressed framework of Hon et al. 2009| , yet the 
space is significantly higher. 



Another probably practical solution, yet to be implemented Hon et al. 2012a 



departs from the linear-space solution of Hon et al. 
in Solution 



20091 . The results, summarized 



33 will probably translate directly into a practical result, comparable 



to that of |Konow and Navarro [2013| . 

It is interesting to take a look at the direct implementations of other proposals 
to see how drastic the above space savings are. For example, [Hon et al. [2010a| 



implemented a predecessor of the linear-space index of Hon et al. [2009 . While they 



obtain times in the range 0.1-0.2 msec for top-10 queries, their index takes about 
4,000 bpc (i.e., 250 times the space of the text!). Still, we can probably do better 



than now in terms of space, as suggested by the result of Tsur [2013] (Solution 30 1 



but possibly at the price of significantly higher time, since a CSA is much slower 
to access a cell than a wavelet tree. 



12. CONCLUSIONS 

While document retrieval on several Western natural languages can be handled 
with simple inverted indexes, extending this task to other languages and scenarios 
requires solving document retrieval on general sequence collections. This has proven 
to be much more algorithmically challenging and has stimulated a fair amount 
research in the last decade. Many of those problems can be reduced to "range 
color" problems, which have many additional applications in data mining. 

Table [2] gives the time/space complexities achieved for the different range color 



Those are still unpublished results, newer than those in Navarro and Valenzuela [2012 
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Problem 


Sol 


Time 


Space 


Color Listing 


To 


Real 


Data + 0{n) 


Color Listing with Frequencies 


15 


Optimal 


Data + 0(n Ig Ig D + D Ig n) 


Color Counting 


19 

m 


Ig n/ Ig Ig n 
lg(ep - sp+l) 


ralgn + o(nlgn) 
n\gD + o{n\gD) + 0(n) 


Top-fc Heaviest Colors 


23 
32 


Real 


0{n\gD) 
n\gD + o{n\gD) * 


Top-fc Colors ((l+€)-approximation) 


28 


fclgDlg(lA) 


0((nA) IgDlgn) 



Table 2. Best time/space complexities achieved for range color problems. The asterisk means 
that the structure contains sufficient information to reproduce any array cell in O(lgD) time. 



problems we have considered in this survey. While the solutions to the variants 
of color listing are rather satisfactory, those for color counting and top-fc heaviest 
colors have good time complexities, but the space is linear. Worse, there exist only 
approximate solutions to top-fc colors (and this seems to be intrinsic), which in 
addition require superlinear space. 

Table ^ gives simplified time/space complexities for the document retrieval prob- 
lems we have considered. Each category is ordered by decreasing space (consid- 
ering, when they are not directly comparable, the most common case). There 
are optimal-time solutions to all the problems, yet requiring linear space (except, 
notably, computing document frequency). On the other hand, there are asymptot- 
ically space-optimal solutions in all cases, adding just o{n) bits on top of a CSA 
(except, again, document frequency, which adds 0{n)). Some of those solutions 
have emerged in this survey. The space-optimal solutions require, however, poly- 
logarithmic time. There are various tradeoffs in between using more space and less 
time. It is unknown which are the best times that can be achieved in optimal space. 

On the practical side, all the successful implemented solutions have use the doc- 
ument array represented with wavelet trees. For document listing with frequencies, 
these solutions require about 3 times the size of the collection, and replace it. The 
most space-efficient solutions may reach, on compressible collections, as little as 
12 bpc, and require a few tens of milliseconds to run a query. The more space- 
demanding solutions (and all the solutions on incompressible collections) take as 
much as 26 bpc and solve the queries in a few milliseconds. For top-fc queries the 
most compressed solutions use about 1 millisecond per delivered document, whereas 
the most space-demanding ones take 1-4 microseconds. 

In general, the current solutions give satisfactory time performance, but their 
space requirements are still too high. This is in sharp contrast with the pat- 
tern matching problem, which the CSAs solve efficiently within optimal space (i.e., 
the entropy of the collection) and in addition replace the collection Navarro and 
Makinen 2007 Ferragina et al. 2009 . While it is possible that the solutions using in 



theory the space of two CSAs can be successfully implemented, we believe that the 
recent optimal-space solutions (i.e., that of Tsur [2013| and the new ones proposed 



in this survey) should be the basis of the next generation of practical document 
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Space 
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1 


O(nlgn) 




15 
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13 
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|CSA| +o(n) 


Document Frequency 


21 
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|CSA| + 0{n) 


Top-fc Most Important Documents 


24 
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O(nlgn) 




35 
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\CSA\ + n\gD 




31 
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|CSA| +o(n) 


Top-fe Documents 


27 
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0(nlgD + nlg(T) 




33 
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|CSA| + nlgD 




32 


Igfclgi+^n 


2|CSA| + o(n) 




30 


lgA;lg2+^n 


|CSA| +o(n) 



Table 3. Best time/space complexities achieved for document retrieval problems. We assume to 
simplify that tsA = Ig^^"^ n for some constant e > 0, that D = o{n), and that all times are in 
addition to tsearch ("i) (this is just m when using 0{n Ig n) bits of space) and per element output (the 
last line has an additive constant of (Iglgn)^). We also write space nlgD for nlgZ) + o(nlg_D). 



retrieval indexes. Still, their theoretical time complexities suggest that a good deal 
of algorithm engineering will be necessary to render their times acceptable. 

In this survey we have focused on a small subset of fundamental color range and 
document retrieval problems. However, there are more complex and challenging 
ones. Arguably, the most important extension left out is to consider queries formed 
by more than one pattern string. The document listing problem is then extended 
to listing those documents where some of the patterns appear (union queries), or 
all of the patterns appear (intersection queries) or, generalizing, where at least t of 
the patterns appear (thresholded queries). The corresponding top-fc problems aim 
at finding the k highest weighted documents among those that are listed, where 
the weight involves contributions from all the patterns appearing in the document. 
One can define the corresponding color listing problems as well. 

Those problems are, of course, well known in the inverted index literature, where 
most of the research focuses on intersections (see Barbay et al. 2009| for a recent 



survey). There is difficulty measure of the problem (i.e., a lower bound) that applies 
to the corresponding problem of intersecting (inverted) lists: the number of jumps 
from one list to another when reading the documents from all the lists in increasing 
order. Gagie et al. 2013) consider the intersection problem on color arrays (and 
hence on document arrays) represented as wavelet trees, and show that they can 
achieve a complexity close to that lower bound. With respect to top-fc algorithms 
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on inverted indexes, those for intersections usually carry out a boolean intersection 
first and then filter the highest frequencies, so they can be extended to document 
retrieval using Gagie et al.[ s solution. The algorithms for top-fc unions generally 
process the documents in decreasing frequency order. It is not difficult to adapt the 
top-fc document retrieval techniques we have covered to retrieve the results online, 
that is, without knowing fc a priori. This lets them simulate the sequential access 
to an inverted list where the documents are listed in decreasing frequency order, 
and thus any algorithm on inverted lists can be simulated on top of these iterators. 
A more theoretical line of research aims at obtaining time complexities that are 
independent of the inverted lists. [Ferragina et al. " 2003| show that the problem 
for two patterns of length 0{m) can be solved using 0{n^^^\gn) words of space 
and 0{m + y/n + docc) time. This space is impractical in most cases. It was 
improved to O(nlgn) words by [Cohen and Porat [2010|, alt hough the time rai sed 
to 0{m + Vndocclg^'^ n) 



Then it was further improved 



Hon et al. 2010b 



to 



linear space and 0{m + \Jn docc Ig n) time, by extending the succinct framework 



(Solution 26 ) so that all pairs of marked nodes are preprocessed (thus the nodes 



must be much larger now, hence the complexity) . Hon et al. [2010b| also generalize 



the solution to handle top-fc retrieval (where docc becomes fc in the time) and to 
handling up to t patterns, for t fixed at indexing time. Here the time worsens to 
0(mi -I- n^~^/*docc"^'*polylog(ri)), which quickly tends to linear time. These results 
suggest that this problem is significantly more difficult than those involving just 
one pattern. 

2002] shows that listing the documents where a pattern does 



Muthukrishnan 



not appear can be done in real time and linear space, yet his solution does not 
seem amenable to the space-reduction techniques that have been developed for 
the simple document listing problem. In any case, this query is most interesting 
when combined with other patterns that must appear in the documents. This was 
addressed by Fischer et al. [2012] and improved by Hon et al. 2012b , who achieve 
linear space and 0(m ■\- -^/nlglgn -I- -s/ridocclg^'^ n) time to handle one "positive" 
and one "negative" pattern. 

Although there has been already a fair amount of theoretical and practical re- 
sults on the simpler problems, this is a young and vibrant research area, where we 
are seeing just the first results on those more complex queries. There have also 
been only very preliminary results on other important issues such as dynamism, 
parallelism, and secondary memory (see, e.g.. Shah et al. 2012] ). We expect this 
area to evolve in the next years into a more mature field where the current research 
will turn into practical systems. 

To conclude, an interesting question is whether these more general data structures 
will ever be able to compete with inverted indexes in the very same niche the latter 
have been designed for. While we do not expect inverted indexes to be overthrown 
in typical natural language queries, we do believe the suffix-array-based solutions for 
general sequences will prove more efficient in handling more sophisticated queries, 
where the simple bag-of- words model is insufficient. This includes features that are 



becoming standard in search engines, such as phrase patterns Patil et al. 2011 



Fariiia et al. 2012 Culpepper et al. 2012 , approximate pattern matching Navarro 



et al. 2001 Celikik and Bast 2009 , and autocompletion queries Bast et al. 2008 



to name a few. The given references show either that suffix-array-based solutions 



Document Retrieval on General Sequence Collections • 43 

are superior to inverted indexes, or that the latter have to be significantly extended 
to cope with those types of queries. 
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APPENDIX 

A. RANGE MINIMUM QUERIES AND LOWEST COMMON ANCESTORS 

The RMQ problem has a fascinating history, intimately related to the LCA problem. 



Harel and Tarjan [1984| showed that the LCA problem could be solved in constant 



time after a linear-time preprocessing, but their algorithm was quite complicated. 
Schieber and Vishkin [1988) managed to somehow simplify the algorithm while 



retaining its optimality. Berkman and Vishkin [1993| showed that, if one traverses 



the tree in preorder and writes down the depths of the touched nodes in an array, 
then the LCA problem on the tree becomes an RMQ problem on the array of depths 
(one also needs a pointer from each tree node to its first occurrence in the array). 
This RMQ problem is particular because consecutive array entries differ by ±1, and 



this was exploited by Berkman and Vishkin to solve it. Gabow et al. [1984 showed 



that, in turn, a general RMQ problem could be converted into an LCA problem, 
by considering the Cartesian tree Vuillemin 1980| on the array (Definitional). The 



observation Gabow et al. made is that RMQi(sp, ep) corresponds to the LCA of the 



nodes corresponding to array positions i and j of the Cartesian tree of L. 



Finally, Bender and Farach-Colton [20001 simplified the methods up to a point 



that the solution was considered to be practical: To solve the general RMQ problem, 
one builds the Cartesian tree and solves the LCA problem on it instead. To solve 
that LCA problem, one traverses the tree and converts it into a restricted RMQ 
(±1) problem. To solve that problem, one precomputes solutions to all intervals 
whose length is a power of 2, RMQi^{i,i -I- 2^ — 1), so that any interval L[sp,ep] is 
covered by two such (overlapping) intervals, [sp, sp -\- 2^ — 1] and [ep — 2^ -\- 1, ep], 
for £ — [\g{ep — sp + 1)J . To avoid using 0{n\gn) words of space with all those 
precomputed intervals, one precomputes only the intervals starting at multiples 
of ^Ign. Queries then cover a number of whole intervals plus two partial ones 
at the tails. For the part covered by whole intervals we have two candidates to 
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be the answer, as explained. For the tails, since this is a restricted (±1) RMQ 
problem, there are only ^Jn possible distinct intervals, and thus a precomputed 
table of sublinear size can answer RMQ queries on all the possible ranges within 
any possible interval. We then compare the (up to) 4 candidate cells in L and 

return the minimum. This explains Solution [61 

A new twist to the problem was given by Fischer and Heun 20111, who aimed 



at solving the RMQ problem without accessing the array L. They showed that it is 
sufficient to store 2n + o{n) bits in order to answer RMQ queries in constant time. 
The idea is to start with the Cartesian tree and convert it into a general tree by 
the well-known isomorphism: We create a special root for the general tree, and the 
nodes in the leftmost branch of the binary tree are the children of the root. Their 
right subtrees are recursively converted into general trees. It can then be proved 
that RMQi(i, j) is obtained by first computing the LCA of the ith and jth nodes 
(in preorder) of this general tree, and then taking the child of this node in the path 
to node j. Various compact representations of general trees can carry out those 
operations in constant time and 2n + o{n) bits in total (Solution p|. The reader 
may have noticed that one has to solve the LCA problem for those compact tree 
representations, by resorting at the end of the day to the other RMQ algorithms 
we have described. Yet, we have not accessed L at all! 

Furthermore, by noting that each Cartesian tree yields different RMQ answers for 
some query, and that there are about 4"/n^" different binary trees on n nodes, one 
can see that lg(4"/n'^/^) = 2n — O(lgn) bits are necessary to distinguish between 
all the possible arrays L. This explains Solution [7| 

B. PROOF OF CORRECTNESS OF SADAKANE'S ALGORITHM 



The algorithm of Sadakane [2007| is a variant of that of Muthukrishnan [2002 



that, instead of continuing iff L[p] < sp, continues iff F[C[p 
V[C\p]] ^ 1 



While Muthukrishnan [2002 



of his algorithm, the variant of Sadakane [20071 is not clearly proved correct 



(and if so it sets 
gives a detailed proof of the correctness 

As 



we have shown this is a delicate issue, here we prove [Sadakane [ s algorithm correct. 
More precisely, we prove it is correct when it visits the left subtree first. Both 
algorithms are described in detail Section [4| where it is shown that Sadakane s 
algorithm may fail if visiting the right subtree first. 

It is useful to realize that both algorithms reconstruct the top part of the Carte- 
sian tree of L[sp, ep]. Muthukrishnan s is known to reconstruct precisely the nodes 



p where L[p] < sp (by the definition of L, these are the leftmost occurrences of 
each document), and we prove now that Sadakane s algorithm does the same. Let 



us call CT the part of the Cartesian tree reconstructed by the algorithms. 



We assume Muthukrishnan s algorithm also visits the left subtree first (although 
in its case this does not make a difference), and prove by induction that both 
algorithms process the same intervals [i,j], in the same order, and perform the 
same action. Both start with [sp, ep], compute the position p = RMQi(sp,ep), 
and split the interval at p, because there must be some value smaller than sp in 



[sp, ep], and thus L[p] < sp (Muthukrishnan s algorithm), and because bitmap V is 
all zeroed in the beginning ( Sadakane s algorithm) . 



Now consider a general interval [i,j], where by inductive hypothesis both algo- 
rithms have performed exactly the same steps until now. Both algorithms will 
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compute p — RMQ]^{i,j). Now there are two cases: 



—L[p\ < sp. Then Muthukrishnan s algorithm will report docuraent C[p\, which 
means it is the first time it sees this document. By the inductive hypothesis, 
|Sadakan"e| s algorithm has visited the same documents until now, thus it is also the 
first time it sees document C[p]. Hence it holds F[C[p]] = 0, and the algorithm 
also reports C[p]. Then both split the interval at p and continue processing it. 

~L[p\ > sp. This means that the first occurrence of C[p] in [sp, ep] is to the left of 
p. Moreover, it is to the left of i (otherwise the RMQ operation would have given 



that position instead of p) . Since [Muthukrishnan s algorithm finds the leftmost 



occurrence of each document in the interval, and we assume it reconstructs CT 
in left-to-right preorder, any node to the left of i must have already been recon- 
structed, because it will not be visited later. Then, [Muthukrishnan s algorithm 
has already output C[p] and, by the inductive hypothesis, Sadakane s algorithm 
has already output C[p] as well. Thus y[C[p]] — 1, and then both algorithms 
terminate the recursion in this node. 

C. CONSTANT-TIME ARRAY INITIALIZATION IN LINEAR-BIT SPACE 

The classical solution to this problem uses linear space, that is, 0{D\gD) extra 



bits on top of the array (see Ex. 2.12, page 71 in Aho et al. [1974[ or a complete 



description by Mehlhorn [1984[, Sec. III.8.1). We show here how to implement the 



solution using only 0{D) bits. To be precise, we are interested in implementing the 
following data structure. 

Definition 10 (Initializable Array) An initializable array is a data structure 
V[l, D] that supports the operations init(V, D, v), read(V, i) and write{V, i, v). The 
first operation initializes V[i] -S— v for all 1 < i < D; the second obtains V[i] and 
the third sets V[i] -s— v. 

The classical technique is as follows. We use a second array C/[l, D] and a stack 
S[1,D], both storing indices in [1,-D]. An additional variable < t < D tells the 
current size of S, and variable / stores the initialization value. 

Initialization of the whole structure, init(V,D,v), consists of setting t 4— and 
/<—!;. The invariant maintained by the structures is as follows: 

V[i\ is initialized ^=^ (1 < U\i] < t A S[U[i\] = i) , 

which is immediately correct once we set t = 0. The idea is to distinguish initialized 
entries V[i] because U[i] = j and there is a back pointer S[j] — i. Let us take an 
uninitialized entry V[i]. Value U[i] is not initialized either. If U[i] < 1 or U[i] > t, 
we know for sure that V[i] is not initialized. Yet it could be that 1 < U[i] < t. But 
then it is not possible that 5'[C/[i]] = i because entry U[i] in S has been used to 
initialize another entry V[i'] and then S'[f/[i]] = i' ^ i. Operation read{V,i) is as 
follows. If V[i] is initialized, then it returns V[i], otherwise it returns /. Operation 
write{V, i, v) is as follows. If V[i] is not initiahzed, it first sets i -s— i + 1, U[i] <— t, 
and S[t] <— i. After the possible initialization, it sets V[i] <— v. It is interesting that 
one can even uninitialize V[i], by setting 5'[[/[i]] <— S[t], U[S[t]] <— U[i], t <— t — 1. 



Fig. 16 (left) illustrates the basic technique. 
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B |o|o|i|i| I I I |1|0|1|0| 



Fig. 16. On the left, the classical scheme, tripling the space. On the right, our compact solution 
using 3n extra bits. 



To reduce space we use, instead of array U and stack S, a bit vector B[l,D] 
so that B[i] = 1 iff V[i\ has been initiahzed. This way we require only D bits in 
addition to V. Of course the problem translates into initializing B[i] ^— for all i. 
We now take advantage of RAM model of computation, with word size w >\gD. 

As B is stored as a contiguous sequence of bits, let us interpret this sequence 
as an array B'[1,D'] of D' = \D/w~\ entries, each entry holding a computer word 
of w bits of B. We can apply now the classical solution to B' , so that B' can be 
initialized in constant time (at value B'[i] = 0). The extra space on top of B' is 
2D'lgD' < 2D bits. Together with B', the space overhead of the solution is 3-D 
bits. Now, in order to determine whether V[i] is initialized, we just check B[i]: 
We compute q — i div w and r = i mod w and check the bit number r + 1 of 
B'[q + 1]. If B'[q + 1] is not yet initialized, we know B[i] — and thus V[i] is not 
yet initialized. To initialize V[i] we set the (r + l)-th bit of B'[q + 1], previously 
initializing i?'[(7 + 1] ^0 if needed. Figure [l6] (right) illustrates our solution. 

We note that this idea can be carried out further one more level to achieve 
D + o{D) extra bits of space, instead of 3D. 



